from:"HyukjinKwon"

[GitHub] spark pull request #22857: [SPARK-25860][SQL] Replace Literal(null, _) with ...

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22857#discussion_r238489445
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala 
---
@@ -31,14 +31,14 @@ import org.apache.spark.scheduler.{SparkListener, 
SparkListenerJobEnd}
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.expressions.Uuid
 import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation
-import org.apache.spark.sql.catalyst.plans.logical.{Filter, 
OneRowRelation, Union}
+import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Union}
--- End diff --

Yea, also it's unrelated import cleanup. It should be discouraged because 
it might make backporting / reverting potentially difficult, and sometimes 
those changes make readers confused.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Skips Python resource limit on Win...

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
Last comment was a minor comment for a doc - actually the whole point was a 
minor one. It does related with Windows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small clea...

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23200
  
Thank you @srowen and @BryanCutler 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238199960
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

I see. Can we try date first above? was wondering if there was a reason to 
try date first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23203: [SPARK-26252][PYTHON] Add support to run specific...

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23203#discussion_r238173745
  
--- Diff: python/run-tests-with-coverage ---
@@ -50,8 +50,6 @@ export SPARK_CONF_DIR="$COVERAGE_DIR/conf"
 # This environment variable enables the coverage.
 export COVERAGE_PROCESS_START="$FWDIR/.coveragerc"
 
-# If you'd like to run a specific unittest class, you could do such as
-# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
 ./run-tests "$@"
--- End diff --

BTW, it works with coverage script as well. manually tested.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23203: [SPARK-26252][PYTHON] Add support to run specific unitte...

2018-12-03 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23203
  
cc @cloud-fan, @dongjoon-hyun, @icexelloss, @BryanCutler, @viirya (who I 
talked about this before).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23203: [SPARK-26252][PYTHON] Add support to run specific...

2018-12-03 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23203

[SPARK-26252][PYTHON] Add support to run specific unittests and/or doctests 
in python/run-tests script

## What changes were proposed in this pull request?

This PR proposes add a developer option, `--testnames`, to our testing 
script to allow run specific set of unittests and doctests.


**1. Run unittests in the class**

```
./run-tests --testnames 'pyspark.sql.tests.test_arrow ArrowTests'
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow 
ArrowTests']
Starting test(python2.7): pyspark.sql.tests.test_arrow ArrowTests
Starting test(pypy): pyspark.sql.tests.test_arrow ArrowTests
Finished test(python2.7): pyspark.sql.tests.test_arrow ArrowTests (14s)
Finished test(pypy): pyspark.sql.tests.test_arrow ArrowTests (14s) ... 22 
tests were skipped
Tests passed in 14 seconds

Skipped tests in pyspark.sql.tests.test_arrow ArrowTests with pypy:
test_createDataFrame_column_name_encoding 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be 
installed; however, it was not found.'
test_createDataFrame_does_not_modify_input 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be 
installed; however, it was not found.'
test_createDataFrame_fallback_disabled 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be 
installed; however, it was not found.'
test_createDataFrame_fallback_enabled 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped
...
```

**2. Run single unittest in the class.**

```
./run-tests --testnames 'pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion'
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion']
Starting test(pypy): pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion
Starting test(python2.7): pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion
Finished test(pypy): pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion (0s) ... 1 tests were skipped
Finished test(python2.7): pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion (8s)
Tests passed in 8 seconds

Skipped tests in pyspark.sql.tests.test_arrow 
ArrowTests.test_null_conversion with pypy:
test_null_conversion (pyspark.sql.tests.test_arrow.ArrowTests) ... 
skipped 'Pandas >= 0.19.2 must be installed; however, it was not found.'
```

**3. Run doctests in single PySpark module.**

```
./run-tests --testnames 'pyspark.sql.dataframe'
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.dataframe']
Starting test(pypy): pyspark.sql.dataframe
Starting test(python2.7): pyspark.sql.dataframe
Finished test(python2.7): pyspark.sql.dataframe (47s)
Finished test(pypy): pyspark.sql.dataframe (48s)
Tests passed in 48 seconds
```

Of course, you can mix them:

```
./run-tests --testnames 'pyspark.sql.tests.test_arrow 
ArrowTests,pyspark.sql.dataframe'
\Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python tests: ['pyspark.sql.tests.test_arrow 
ArrowTests', 'pyspark.sql.dataframe']
Starting test(pypy): pyspark.sql.dataframe
Starting test(pypy): pyspark.sql.tests.test_arrow ArrowTests
Starting test(python2.7): pyspark.sql.dataframe
Starting test(python2.7): pyspark.sql.tests.test_arrow ArrowTests
Finished test(pypy): pyspark.sql.tests.test_arrow ArrowTests (0s) ... 22 
tests were skipped
Finished test(python2.7): pyspark.sql.tests.test_arrow ArrowTests (18s)
Finished test(python2.7): pyspark.sql.dataframe (50s)
Finished test(pypy): pyspark.sql.dataframe (52s)
Tests passed in 52 seconds

Skipped tests in pyspark.sql.tests.test_arrow ArrowTests with pypy:
test_createDataFrame_column_name_encoding 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be 
installed; however, it was not found.'
test_createDataFrame_does_not_modify_input 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pandas >= 0.19.2 must be 
installed; however, it was not found.'
test_createDataFrame_fallback_disabled 
(pyspark.sql.tests.test_arrow.ArrowTests) ... skipped 'Pan

[GitHub] spark pull request #23201: [SPARK-26246][SQL] Infer date and timestamp types...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23201#discussion_r238141831
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala
 ---
@@ -121,7 +121,18 @@ private[sql] class JsonInferSchema(options: 
JSONOptions) extends Serializable {
 DecimalType(bigDecimal.precision, bigDecimal.scale)
 }
 decimalTry.getOrElse(StringType)
-  case VALUE_STRING => StringType
+  case VALUE_STRING =>
+val stringValue = parser.getText
+if ((allCatch opt 
options.timestampFormat.parse(stringValue)).isDefined) {
--- End diff --

I haven't tested this by myself but I think it has the same problem 
(https://github.com/apache/spark/pull/23202#discussion_r238141702)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238141702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

I mean, IIRC, if the pattern is, for instance, `-MM-dd`, 2010-10-10 and 
also 2018-12-02T21:04:00.123567 are parsed as dates because the current parsing 
library checks if the string is matched and ignore the rest of them.

So, if we try date first, it will works for its default value but it we do 
some weird patterns, it wouldn't work again.

I was thinking we can fix it if we use `DateTimeFormatter`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23202: [SPARK-26248][SQL] Infer date type from CSV

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23202#discussion_r238141062
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala
 ---
@@ -98,6 +100,7 @@ class CSVInferSchema(options: CSVOptions) extends 
Serializable {
   compatibleType(typeSoFar, 
tryParseDecimal(field)).getOrElse(StringType)
 case DoubleType => tryParseDouble(field)
 case TimestampType => tryParseTimestamp(field)
+case DateType => tryParseDate(field)
--- End diff --

The problem here is it looks a bit odd that we try date type later. IIRC 
the root cause is related with date paring library. Couldn't we try date first 
if we switch the parsing library? I thought that's in progress.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small clea...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23200
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small clea...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23200
  
cc @BryanCutler - it's a small minor followup.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23200: [SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Sma...

2018-12-02 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23200

[SPARK-26033][SPARK-26034][PYTHON][FOLLOW-UP] Small cleanup and 
deduplication in ml/mllib tests

## What changes were proposed in this pull request?

This PR is a small follow up that puts some logic and functions into 
smaller scope and make it localized, and deduplicate.

## How was this patch tested?

Manually tested. Jenkins tests as well.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark 
followup-SPARK-26034-SPARK-26033

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23200.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23200


commit cbe13bf2fdc99f033746f5f3ff67c73fe7f27200
Author: Hyukjin Kwon 
Date:   2018-12-02T10:00:18Z

Small cleanup and deduplication




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Skips Python resource limit on Win...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
Thanks, @rdblue.

Merged to master and branch-2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23173
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-02 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238091149
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

OK. Shouldn't be a big deal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23192: [SPARK-26221][SQL] Add queryId to IncrementalExecution

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23192
  
Hey Reynold looks JIRA had to be `SPARK-26241`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238064338
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

I see the problem. OrcFileFormat uses a flag approach. For instance:



```
private var isGeneratorInitiated = false

lazy val univocityGenerator = {
val charset = Charset.forName(params.charset)
val os = CodecStreams.createOutputStreamWriter(context, new Path(path), 
charset)
new UnivocityGenerator(dataSchema, os, params)
}

if (isGeneratorInitiated) {
univocityGenerator.close()
}
```

Should be okay to stick to it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238058669
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

Ah, it's `getOrElse`. Okay but still can we simplify this logic? Looks a 
bit confusing. For instance, I think we can do this with lazy val.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r238058594
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala
 ---
@@ -171,15 +171,21 @@ private[csv] class CsvOutputWriter(
 
   private var univocityGenerator: Option[UnivocityGenerator] = None
 
-  override def write(row: InternalRow): Unit = {
-val gen = univocityGenerator.getOrElse {
-  val charset = Charset.forName(params.charset)
-  val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
-  val newGen = new UnivocityGenerator(dataSchema, os, params)
-  univocityGenerator = Some(newGen)
-  newGen
-}
+  if (params.headerFlag) {
+val gen = getGen()
+gen.writeHeaders()
+  }
 
+  private def getGen(): UnivocityGenerator = univocityGenerator.getOrElse {
+val charset = Charset.forName(params.charset)
+val os = CodecStreams.createOutputStreamWriter(context, new 
Path(path), charset)
+val newGen = new UnivocityGenerator(dataSchema, os, params)
+univocityGenerator = Some(newGen)
+newGen
+  }
+
+  override def write(row: InternalRow): Unit = {
+val gen = getGen()
--- End diff --

Wait .. is this going to create `UnivocityGenerator` for each record?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23173: [SPARK-26208][SQL] add headers to empty csv files when h...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23173
  
Looks fine to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238058149
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  def createArrayType(elementType: DataType): ArrayType = 
DataTypes.createArrayType(elementType)
+
+  def createArrayType(elementType: Column): ArrayType = {
+new ArrayType(ExprUtils.evalTypeExpr(elementType.expr), true)
--- End diff --

Oops, it looks not actually. `elementType.expr.nullable` will return 
nullability from the expression (like `schema_of_json` or string literal), not 
for something related with its input schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23178: [SPARK-26216][SQL] Do not use case class as publi...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23178#discussion_r238057361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -38,114 +38,106 @@ import org.apache.spark.sql.types.DataType
  * @since 1.3.0
  */
 @Stable
--- End diff --

yea actually I was wondering about the same thing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238057238
  
--- Diff: R/pkg/R/functions.R ---
@@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructTypeOrColumn", c("character", 
"structType", "Column"))
+
 #' @details
 #' \code{from_json}: Parses a column containing a JSON string into a 
Column of \code{structType}
 #' with the specified \code{schema} or array of \code{structType} if 
\code{as.json.array} is set
 #' to \code{TRUE}. If the string is unparseable, the Column will contain 
the value NA.
 #'
 #' @rdname column_collection_functions
 #' @param as.json.array indicating if input string is JSON array of 
objects or a single object.
-#' @aliases from_json from_json,Column,characterOrstructType-method
+#' @aliases from_json from_json,Column,characterOrstructTypeOrColumn-method
 #' @examples
 #'
 #' \dontrun{
 #' df2 <- sql("SELECT named_struct('date', cast('2000-01-01' as date)) as 
d")
 #' df2 <- mutate(df2, d2 = to_json(df2$d, dateFormat = 'dd/MM/'))
 #' schema <- structType(structField("date", "string"))
 #' head(select(df2, from_json(df2$d2, schema, dateFormat = 'dd/MM/')))
-
 #' df2 <- sql("SELECT named_struct('name', 'Bob') as people")
 #' df2 <- mutate(df2, people_json = to_json(df2$people))
 #' schema <- structType(structField("name", "string"))
 #' head(select(df2, from_json(df2$people_json, schema)))
-#' head(select(df2, from_json(df2$people_json, "name STRING")))}
+#' head(select(df2, from_json(df2$people_json, "name STRING")))
+#' head(select(df2, from_json(df2$people_json, 
schema_of_json(head(df2)$people_json}
 #' @note from_json since 2.2.0
-setMethod("from_json", signature(x = "Column", schema = 
"characterOrstructType"),
+setMethod("from_json", signature(x = "Column", schema = 
"characterOrstructTypeOrColumn"),
   function(x, schema, as.json.array = FALSE, ...) {
 if (is.character(schema)) {
-  schema <- structType(schema)
+  jschema <- structType(schema)$jobj
+} else if (class(schema) == "structType") {
+  jschema <- schema$jobj
+} else {
+  jschema <- schema@jc
 }
 
 if (as.json.array) {
-  jschema <- 
callJStatic("org.apache.spark.sql.types.DataTypes",
- "createArrayType",
- schema$jobj)
-} else {
-  jschema <- schema$jobj
+  # This case is R-specifically different. Unlike Scala and 
Python side,
--- End diff --

If so, the provided schema is wrapped by Array. The test cases are ...


here 
https://github.com/apache/spark/pull/23184/files#diff-d4011863c8b176830365b2f224a84bf2R1707


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238057208
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  def createArrayType(elementType: DataType): ArrayType = 
DataTypes.createArrayType(elementType)
+
+  def createArrayType(elementType: Column): ArrayType = {
+new ArrayType(ExprUtils.evalTypeExpr(elementType.expr), true)
--- End diff --

Yup, that sounds more correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238057161
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala 
---
@@ -225,4 +225,10 @@ private[sql] object SQLUtils extends Logging {
 }
 sparkSession.sessionState.catalog.listTables(db).map(_.table).toArray
   }
+
+  def createArrayType(elementType: DataType): ArrayType = 
DataTypes.createArrayType(elementType)
--- End diff --

Yea I remember that. I thought this case is a bit different from other 
instances actually. It reduces the code complexity in R side because R side 
directly calls this overridden methods. For instance, currently it's being 
called:

```r
jschema <-  callJStatic("org.apache.spark.sql.api.r.SQLUtils",
"createArrayType",
jschema)
```

but if we remove those, it should be like:

```r
if (class(schema) == "dataType") {
  jschema <- callJStatic("org.apache.spark.sql.types.DataTypes",
 "createArrayType",
 schema$jobj)
} else {
  jschema <- callJStatic("org.apache.spark.sql.api.r.SQLUtils",
 "createArrayType",
 schema$jobj)
}
```

Let me try to remove this one anyway.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238056981
  
--- Diff: R/pkg/R/functions.R ---
@@ -2254,40 +2255,48 @@ setMethod("date_format", signature(y = "Column", x 
= "character"),
 column(jc)
   })
 
+setClassUnion("characterOrstructTypeOrColumn", c("character", 
"structType", "Column"))
--- End diff --

Yup, I agree.. Would you mind if I do this separately? I roughly checked by 
`grep` and looks:

```
./pkg/R/DataFrame.R:setClassUnion("characterOrstructType", c("character", 
"structType"))
./pkg/R/DataFrame.R:setClassUnion("numericOrcharacter", c("numeric", 
"character"))
./pkg/R/DataFrame.R:setClassUnion("characterOrColumn", c("character", 
"Column"))
./pkg/R/DataFrame.R:setClassUnion("numericOrColumn", c("numeric", "Column"))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23184#discussion_r238056921
  
--- Diff: R/pkg/R/functions.R ---
@@ -202,8 +202,9 @@ NULL
 #'  \itemize{
 #'  \item \code{from_json}: a structType object to use as the 
schema to use
 #'  when parsing the JSON string. Since Spark 2.3, the 
DDL-formatted string is
-#'  also supported for the schema.
-#'  \item \code{from_csv}: a DDL-formatted string
+#'  also supported for the schema. Since Spark 3.0, 
\code{schema_of_json} or
+#'  a string literal can also be accepted.
--- End diff --

Basically yea same. The only difference is it allows takes a string literal 
(not only string). Let me try to clarify it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir function to ...

2018-12-01 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23151
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Skips Python resource limit on Win...

2018-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
@vanzin and @rdblue, I updated the doc because it sounds not wrong to me. 
But, for clarification, we shouldn't really document we support something 
that's not tested (in particular such case above that the failure case was 
found). Also, IMHO, it's better to make it simple when there's Windows issue in 
terms of maintenance since not so many Windows maintainers exist in Spark.

The main functionality of that configuration is limiting resource if i'm 
not mistaken. The allocation ones in other modes is secondary if I am not 
mistaken. Technically someone should test it and fix the doc with showing how 
it works in another PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Skips Python resource limit...

2018-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r238048542
  
--- Diff: docs/configuration.md ---
@@ -190,6 +190,8 @@ of the most common options to set are:
 and it is up to the application to avoid exceeding the overhead memory 
space
 shared with other non-JVM processes. When PySpark is run in YARN or 
Kubernetes, this memory
 is added to executor resource requests.
+
+NOTE: This configuration is not supported on Windows.
--- End diff --

> Python memory usage may not be limited on platforms that do not support 
resource limiting, such as Windows

Let me change it to this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Skips Python resource limit...

2018-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r238048453
  
--- Diff: docs/configuration.md ---
@@ -190,6 +190,8 @@ of the most common options to set are:
 and it is up to the application to avoid exceeding the overhead memory 
space
 shared with other non-JVM processes. When PySpark is run in YARN or 
Kubernetes, this memory
 is added to executor resource requests.
+
+NOTE: This configuration is not supported on Windows.
--- End diff --

I would say it's just supported or unsupported tho rather than making it 
complicated. I'm not 100% sure how it works on Windows with other modes. For 
instance, extra memory will probably allocated via Yarn but looks no one tested 
it before. Technically, it does not work on local mode as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20788: [SPARK-23647][PYTHON][SQL] Adds more types for hint in p...

2018-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20788
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23151: [SPARK-26180][CORE][TEST] Reuse withTempDir function to ...

2018-11-30 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23151
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23185: [MINOR][Docs] Fix typos

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23185
  
Thanks for skimming the whole doc. cc @srowen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23185: [MINOR][Docs] Fix typos

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23185
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23162
  
Few nit comments because I thought we should avoid: negative comparison; 
however, let me leave it to @srowen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237747341
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -430,8 +430,8 @@ package object config {
   .doc("The chunk size in bytes during writing out the bytes of 
ChunkedByteBuffer.")
   .bytesConf(ByteUnit.BYTE)
   .checkValue(_ <= ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH,
-"The chunk size during writing out the bytes of" +
-" ChunkedByteBuffer should not larger than Int.MaxValue - 15.")
+"The chunk size during writing out the bytes of ChunkedByteBuffer 
should" +
+  s" not be greater than 
${ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH}.")
--- End diff --

not be greater than -> less than or equal to


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237746963
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -513,7 +513,7 @@ package object config {
 "is written in unsafe shuffle writer. In KiB unless otherwise 
specified.")
   .bytesConf(ByteUnit.KiB)
   .checkValue(v => v > 0 && v <= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024,
-s"The buffer size must be greater than 0 and less than" +
+s"The buffer size must be positive and not greater than" +
--- End diff --

not greater than -> less than or equal to


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23162: [MINOR][DOC] Correct some document description er...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23162#discussion_r237747015
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -503,7 +503,7 @@ package object config {
 "made in creating intermediate shuffle files.")
   .bytesConf(ByteUnit.KiB)
   .checkValue(v => v > 0 && v <= 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH / 1024,
-s"The file buffer size must be greater than 0 and less than" +
+s"The file buffer size must be positive and not greater than" +
--- End diff --

not greater than -> less than or equal to


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237746374
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -1134,39 +1130,40 @@ class SparkSubmitSuite
 val hadoopConf = new Configuration()
 updateConfWithFakeS3Fs(hadoopConf)
 
-val tmpDir = Utils.createTempDir()
-val pyFile = File.createTempFile("tmpPy", ".egg", tmpDir)
+withTempDir { tmpDir =>
+  val pyFile = File.createTempFile("tmpPy", ".egg", tmpDir)
 
-val args = Seq(
-  "--class", UserClasspathFirstTest.getClass.getName.stripPrefix("$"),
-  "--name", "testApp",
-  "--master", "yarn",
-  "--deploy-mode", "client",
-  "--py-files", s"s3a://${pyFile.getAbsolutePath}",
-  "spark-internal"
-)
+  val args = Seq(
+"--class", 
UserClasspathFirstTest.getClass.getName.stripPrefix("$"),
+"--name", "testApp",
+"--master", "yarn",
+"--deploy-mode", "client",
+"--py-files", s"s3a://${pyFile.getAbsolutePath}",
+"spark-internal"
+  )
 
-val appArgs = new SparkSubmitArguments(args)
-val (_, _, conf, _) = submit.prepareSubmitEnvironment(appArgs, conf = 
Some(hadoopConf))
+  val appArgs = new SparkSubmitArguments(args)
+  val (_, _, conf, _) = submit.prepareSubmitEnvironment(appArgs, conf 
= Some(hadoopConf))
 
-conf.get(PY_FILES.key) should be (s"s3a://${pyFile.getAbsolutePath}")
-conf.get("spark.submit.pyFiles") should (startWith("/"))
+  conf.get(PY_FILES.key) should be(s"s3a://${pyFile.getAbsolutePath}")
--- End diff --

ditto. Technically it should better be assert and avoid infix notation but 
I think we don't have to do it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237746228
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -985,37 +985,38 @@ class SparkSubmitSuite
 val hadoopConf = new Configuration()
 updateConfWithFakeS3Fs(hadoopConf)
 
-val tmpDir = Utils.createTempDir()
-val file = File.createTempFile("tmpFile", "", tmpDir)
-val pyFile = File.createTempFile("tmpPy", ".egg", tmpDir)
-val mainResource = File.createTempFile("tmpPy", ".py", tmpDir)
-val tmpJar = TestUtils.createJarWithFiles(Map("test.resource" -> 
"USER"), tmpDir)
-val tmpJarPath = s"s3a://${new File(tmpJar.toURI).getAbsolutePath}"
+withTempDir { tmpDir =>
+  val file = File.createTempFile("tmpFile", "", tmpDir)
+  val pyFile = File.createTempFile("tmpPy", ".egg", tmpDir)
+  val mainResource = File.createTempFile("tmpPy", ".py", tmpDir)
+  val tmpJar = TestUtils.createJarWithFiles(Map("test.resource" -> 
"USER"), tmpDir)
+  val tmpJarPath = s"s3a://${new File(tmpJar.toURI).getAbsolutePath}"
 
-val args = Seq(
-  "--class", UserClasspathFirstTest.getClass.getName.stripPrefix("$"),
-  "--name", "testApp",
-  "--master", "yarn",
-  "--deploy-mode", "client",
-  "--jars", tmpJarPath,
-  "--files", s"s3a://${file.getAbsolutePath}",
-  "--py-files", s"s3a://${pyFile.getAbsolutePath}",
-  s"s3a://$mainResource"
+  val args = Seq(
+"--class", 
UserClasspathFirstTest.getClass.getName.stripPrefix("$"),
+"--name", "testApp",
+"--master", "yarn",
+"--deploy-mode", "client",
+"--jars", tmpJarPath,
+"--files", s"s3a://${file.getAbsolutePath}",
+"--py-files", s"s3a://${pyFile.getAbsolutePath}",
+s"s3a://$mainResource"
   )
 
-val appArgs = new SparkSubmitArguments(args)
-val (_, _, conf, _) = submit.prepareSubmitEnvironment(appArgs, conf = 
Some(hadoopConf))
+  val appArgs = new SparkSubmitArguments(args)
+  val (_, _, conf, _) = submit.prepareSubmitEnvironment(appArgs, conf 
= Some(hadoopConf))
 
-// All the resources should still be remote paths, so that YARN client 
will not upload again.
-conf.get("spark.yarn.dist.jars") should be (tmpJarPath)
--- End diff --

I wouldn't change those spaces alone tho. Let's leave as were.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
@felixcheung, I tested when the user passes in a column that is not a 
literal string, and it shows the results as below:

```
> json <- '{"name":"Bob"}'
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_json(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 
'schema_of_json(`id`)' due to data type mismatch: The input json should be a 
string literal and not null; however, got `id`.;;
'Project [schema_of_json(id#0L) AS schema_of_json(id)#2]
+- Project [id#0L]
   +- Range (0, 1, step=1, splits=None)

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```

```
> csv <- "Amsterdam,2018"
> df <- sql("SELECT * FROM range(1)")
> head(select(df, schema_of_csv(df$id)))
Error in handleErrors(returnStatus, conn) :
  org.apache.spark.sql.AnalysisException: cannot resolve 
'schema_of_csv(`id`)' due to data type mismatch: The input csv should be a 
string literal and not null; however, got `id`.;;
'Project [schema_of_csv(id#3L) AS schema_of_csv(id)#5]
+- Project [id#3L]
   +- Range (0, 1, step=1, splits=None)

at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23184: [SPARK-26227][R] from_[csv|json] should accept schema_of...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23184
  
cc @felixcheung, @viirya and @MaxGekk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23184: [SPARK-26227][R] from_[csv|json] should accept sc...

2018-11-29 Thread HyukjinKwon

GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/23184

[SPARK-26227][R] from_[csv|json] should accept schema_of_[csv|json] in R API

## What changes were proposed in this pull request?

**1. Document `from_csv(..., schema_of_csv(...))` support:**

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, schema_of_csv(csv
```

```
from_csv(csv)
1 Amsterdam, 2018
```


**2. Allow `from_json(..., schema_of_json(...))`**

Before:


```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, 
schema_of_json(head(df2)$people_json
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function âfrom_jsonâ for 
signature â"Column", "Column"â
```


After:

```R
df2 <- sql("SELECT named_struct('name', 'Bob') as people")
df2 <- mutate(df2, people_json = to_json(df2$people))
head(select(df2, from_json(df2$people_json, 
schema_of_json(head(df2)$people_json
```

```
  from_json(people_json)
1Bob
```

**3. (While I'm here) Allow `structType` as schema for `from_csv` support 
to match with `from_json`.**

Before:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"
```

```
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function âfrom_csvâ for 
signature â"Column", "structType"â
```

After:

```R
csv <- "Amsterdam,2018"
df <- sql(paste0("SELECT '", csv, "' as csv"))
head(select(df, from_csv(df$csv, structType("city STRING, year INT"
```

```
from_csv(csv)
1 Amsterdam, 2018
```
    

    
## How was this patch tested?

Manually tested and unittests were added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-26227-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23184


commit 193d68856769f945349449469ad6e536449ec5f0
Author: Hyukjin Kwon 
Date:   2018-11-30T03:12:00Z

from_[csv|json] should accept schema_of_[csv|json] in R API




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Thank you, @felixcheung for approving this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237725742
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

OK. let me remove that check here for now. I think we can discuss about the 
whole problem about how it's implemented later.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22939: [SPARK-25446][R] Add schema_of_json() and schema_...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22939#discussion_r237727671
  
--- Diff: R/pkg/R/functions.R ---
@@ -2230,6 +2237,32 @@ setMethod("from_json", signature(x = "Column", 
schema = "characterOrstructType")
 column(jc)
   })
 
+#' @details
+#' \code{schema_of_json}: Parses a JSON string and infers its schema in 
DDL format.
+#'
+#' @rdname column_collection_functions
+#' @aliases schema_of_json schema_of_json,characterOrColumn-method
+#' @examples
+#'
+#' \dontrun{
+#' json <- '{"name":"Bob"}'
+#' df <- sql("SELECT * FROM range(1)")
+#' head(select(df, schema_of_json(json)))}
+#' @note schema_of_json since 3.0.0
+setMethod("schema_of_json", signature(x = "characterOrColumn"),
+  function(x, ...) {
+if (class(x) == "character") {
+  col <- callJStatic("org.apache.spark.sql.functions", "lit", 
x)
+} else {
+  col <- x@jc
--- End diff --

Yea, I agree. It will throw an analysis exception in that case. I also 
sympathize the concerns here and somewhat we're unclear about this - so I just 
wanted to make it restricted for now.

I'm going to open another PR related with this as a followup (for 
https://github.com/apache/spark/pull/22939#issuecomment-435672008). While I am 
there, I will test when the user passes in a column that is not a literal 
string.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23182: Config change followup to [SPARK-26177] Automated format...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23182
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23177: [SPARK-26212][Build][test-maven] Upgrade maven ve...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23177#discussion_r237719224
  
--- Diff: pom.xml ---
@@ -114,7 +114,7 @@
 1.8
 ${java.version}
 ${java.version}
-3.5.4
+3.6.0
--- End diff --

@kiszk, I think we should update here as well 
https://github.com/apache/spark/blob/master/docs/building-spark.md#apache-maven 
(see https://github.com/apache/spark/pull/22781)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23177: [SPARK-26212][Build][test-maven] Upgrade maven version t...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23177
  
adding @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23178: [SPARK-26216][SQL] Do not use case class as public API (...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23178
  
+1 as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237716807
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

Just for consistency, and to avoid that value is set in JVM but works 
differently in Python side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237716592
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

> That is completely different. Because that affects how the Java side 
talks to the Python side.

Not completely different tho. Ideally we can have one script that handles 
both cases. It deals with OS specific differences in Python sides.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237716012
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

Yea, so the check and be done in Python side and JVM side. Yes, I am 
explicitly disabling it on Windows. I think the proper assumption is that we 
note it doesn't work if that's not tested. If someone finds it working, then 
that patch explicitly enables it with a proper documentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237715420
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

What I meant by "What do you mean "python can't fail"? is quoted from 
`Python should not fail if resource cannot be imported` from @rdblue.

I left the last comment before seeing your comment @vanzin. Let me leave an 
answer for that soon.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237714583
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

Can I add some more people for input? Let me just follow majority. cc 
@ueshin, @BryanCutler.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237714288
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

For instance, forking in Python is disallowed in Python so we do that 
control in JVM side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237713255
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

-1 for adding only Python check from my side. I don't understand why it's 
controversial as well. Python can't fail because `resource` only exists in unix 
based systems. We only have one exception case Windows which we know in JVM 
side.

Configuration control should better be done in JVM side if possible and I 
don't think we should allow this feature work differently on Windows and this 
should be tested before we document we support it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23173: [SPARK-26208][SQL] add headers to empty csv files...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23173#discussion_r237711250
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
 ---
@@ -57,6 +57,9 @@ abstract class OutputWriterFactory extends Serializable {
  * executor side.  This instance is used to persist rows to this single 
output file.
  */
 abstract class OutputWriter {
+  /** Initializes before writing any rows. Invoked on executor size. */
+  def init(): Unit = {}
--- End diff --

I don't think we need to add init only because one case CSV needs it. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22979: [SPARK-25977][SQL] Parsing decimals from CSV using local...

2018-11-29 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22979
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237021410
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

@vanzin, I'm going to remove this flag. Does it then look okay to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-27 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236926113
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

I don't think I'm only the one tho. 

> Why is this code needed? 

I explain above multiple times. See above.

> it's not doing anything useful if you keep the python check around

See above. I want to delete them but added per review comment.

>  The JVM doesn't understand exactly what Python supports of not, it's 
better to let the python code decide that.

Not really. resource module is a Python builtin module that exists unix 
based system. It just does not exist in Windows.

> You say we should disable the feature on Windows. The python-side changes 
already do that. 

I explained above. See the first change I proposed 2d3315a. It relays on 
the environment variable.

> We should not remove the extra memory requested from the resource manager 
just because you're running on Windows - you'll still need that memory, you'll 
just get a different error message if you end up using more than you requested.

Yea, I know it would probably work. My question that is it ever tested? One 
failure case was found and it looks a bit odd that we document it works. It's 
not even tested and shall we make it simple rather then it make it work 
differently until it's tested?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23145: [MINOR][Docs][WIP] Fix Typos

2018-11-27 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23145
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23145: [MINOR][Docs][WIP] Fix Typos

2018-11-27 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23145
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-27 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236692833
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

BTW, i don't get why we should argue like this for this one. Isn't it a 
minor point?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236512196
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

I mean, even if it succeeds to allocate in Yarn and Python worker doesn't 
have the control on that, what's the point? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236494667
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

I mean .. what I am disabling here is setting Python memory limit on 
Windows; looks Yarn side still can allocate more but not tested.

I'm thinking we should disable whole feature on Windows ideally but I 
couldn't either test it on Windows because I don't have Windows Yarn cluster (I 
only have one Windows machine that has dev environment).

I was trying to at least document that it doesn't work because it's going 
to work differently comparing to other OSes that don't have `resource` module 
(For current status, PySpark apps that uses Python worker do not work at all 
and we don't necessarily document it works on Windows). I think it's more 
correct to say it does not work because it's not tested (or at least just 
simply say it's not guaranteed on Windows).

For the reason that I prefer to check it on JVM side instead is, 
1. It's really usual to check it on JVM side when it's possible and make 
the worker simple. It could make it coupled to JVM but it's already strongly 
coupled

2. If Yarn code can also be tested, and if it doesn't work, it should also 
be disabled in JVM side.
If we can find a workaround on Windows, we can fix the Python worker and 
enable it.

3. If it's checked JVM side ahead, it reduces one state in JVM (`memoryMb`) 
is enabled in JVM side but Python skips it.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236488396
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

@vanzin, I mean the first change I did at 
https://github.com/apache/spark/pull/23055/commits/2d3315a7dab429abc4d9ef5ed7f8f5484e8421f1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236487153
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

> The point Ryan and I are making is that there's no point in having this 
check in the Java code, because the Python code already handles it. It doesn't 
work on Windows, and the updated Python code handles that.

My point is we should remove the Python condition. 

Also, I mean I wonder if this whole feature itself introduced at 
https://github.com/apache/spark/pull/21977 works or not since it looks not ever 
tested.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23117
  
> does this add time to running the tests?

Given my local tests, the time diff looked slightly increasing .. I want to 
see how it works in Jenkins ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23117: [WIP][SPARK-7721][INFRA] Run and generate test co...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23117#discussion_r236486298
  
--- Diff: dev/run-tests.py ---
@@ -434,6 +434,63 @@ def run_python_tests(test_modules, parallelism):
 run_cmd(command)
 
 
+def run_python_tests_with_coverage(test_modules, parallelism):
+set_title_and_block("Running PySpark tests with coverage report", 
"BLOCK_PYSPARK_UNIT_TESTS")
+
+command = [os.path.join(SPARK_HOME, "python", 
"run-tests-with-coverage")]
+if test_modules != [modules.root]:
+command.append("--modules=%s" % ','.join(m.name for m in 
test_modules))
+command.append("--parallelism=%i" % parallelism)
+run_cmd(command)
+post_python_tests_results()
+
+
+def post_python_tests_results():
+if "SPARK_TEST_KEY" not in os.environ:
+print("[error] 'SPARK_TEST_KEY' environment variable was not set. 
Unable to post"
+  "PySpark coverage results.")
+sys.exit(1)
--- End diff --

@shaneknapp can you add another environment variable that indicates PR 
builder and spark-master-test-sbt-hadoop-2.7 where we're going to run Python 
coverage? I can check it and explicitly enable it only in that condition.

True, if the condition below (which I checked before at #17669):

```python
os.environ.get("AMPLAB_JENKINS_BUILD_PROFILE", "") == 
"hadoop2.7"
and os.environ.get("SPARK_BRANCH", "") == "master"
and os.environ.get("AMPLAB_JENKINS", "") == "true"
and os.environ.get("AMPLAB_JENKINS_BUILD_TOOL", "") == "sbt")
```

is `True` in Jenkins build or other users environment, it might cause some 
problems (even though looks quite unlikely).

For similar instance, if `AMPLAB_JENKINS` is set in users environment who 
run the tests locally, it wouldn't work anyway tho.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236484520
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

To be honest, I'm skeptical if it really works or not on Windows.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236484322
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

Thanks, Vanzin. However, it really brings complexity on the other hand. 
Maintaining codes for Windows actually are quite costly. We can just simply 
make it atomic if it works or not by disabling it on Windows rather then making 
the feature complicated by introducing another state.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236481476
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

> The allocation on the JVM side is still the correct behavior.

Also, is it ever tested on Windows?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236479633
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

The benefit is as I said less complexity and explicit 
condition/documentation. How are we going to document this then? It's 
automatically disabled when 'resource package is not found in specified Python 
executable at executor side?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236479186
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

If there are more conditions found to skip this, then we should do that 
later in JVM side if possible. If not, we can discuss there later. If it's 
absolutely required to skip in worker side, we can do it later and remove JVM 
side check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r236478702
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -74,8 +74,13 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
   private val reuseWorker = conf.getBoolean("spark.python.worker.reuse", 
true)
   // each python worker gets an equal part of the allocation. the worker 
pool will grow to the
   // number of concurrent tasks, which is determined by the number of 
cores in this executor.
-  private val memoryMb = conf.get(PYSPARK_EXECUTOR_MEMORY)
+  private val memoryMb = if (Utils.isWindows) {
--- End diff --

Strictly we should move it into JVM rather then adding more controls at 
workers. Otherwise we will end up sending a bunch of data and environment 
variables into python workers. It already send the environment variable that 
indicates that this configuration is set and I was thinking what we should do 
is to disable it on certain condition rather then checking the package and skip 
in Python worker side.

I would rather remove python side's check then.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Ah, it just increases number of python threads that run each suit. 
Shouldn't be a big deal. I'm keeping my eyes on the build.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23102#discussion_r236341207
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala 
---
@@ -19,11 +19,11 @@ package org.apache.spark.deploy
 
 import java.io.File
 import java.net.URI
+import java.util.regex.Pattern
 
 import org.apache.commons.lang3.StringUtils
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileSystem, Path}
-
--- End diff --

Previous import style was correct. It complied project's style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23102: [SPARK-26137][CORE] Use Java system property "fil...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23102#discussion_r236341337
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala 
---
@@ -61,11 +61,13 @@ private[deploy] object DependencyUtils extends Logging {
   hadoopConf: Configuration,
   secMgr: SecurityManager): String = {
 val targetDir = Utils.createTempDir()
+val fileSeparator = Pattern.quote(File.separator)
--- End diff --

Mind elaborating why this should be quoted


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22939: [SPARK-25446][R] Add schema_of_json() and schema_of_csv(...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22939
  
Hey @felixcheung, have you found some time to take a look for this please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pyspark.me...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23055
  
Let me get this in in few days if there are no more comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23117: [WIP][SPARK-7721][INFRA] Run and generate test coverage ...

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23117
  
Hey @shaneknapp, have you found some time to take a look for this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23145: [MINOR][Docs] "a R interpreter" -> "an R interpreter"

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23145
  
It's okay but mind taking another look the documentation and fix other 
typos while we are here? I'm pretty sure there are more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-26 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r236233074
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -388,7 +388,7 @@ case class FileSourceScanExec(
 logInfo(s"Planning with ${bucketSpec.numBuckets} buckets")
 val filesGroupedToBuckets =
   selectedPartitions.flatMap { p =>
-p.files.map { f =>
+p.files.filter(_.getLen > 0).map { f =>
--- End diff --

It's non-critical path in terms of performance. Should be okay.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23113
  
Adding @rxin and @gatorsmile fyi. We'll discuss this separately per policy.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23113
  
Typo... auth_token ...  Zeppelin supports it BTW.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23129: [MINOR] Update all DOI links to preferred resolver

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23129
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r236077780
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +143,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in load") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array[Byte]())
--- End diff --

@MaxGekk, not a big deal but let's use `Array.empty[Byte]`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23111: [SPARK-26148][PYTHON][TESTS] Increases default paralleli...

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23111
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23113
  
Also, it doesn't now then looks a race condition or anything. Please fix 
the PR description and JIRA accodingly. It looks potentially confusing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23113: [SPARK-26019][PYTHON] Fix race condition in accumulators...

2018-11-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23113
  
auto token should be set to use Spark. If that's not set, it's a misuse of 
Spark - we should throw an exception. It's weird that we allow here it without 
the token here specifically.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18304: [SPARK-21098] Set lineseparator csv multiline and csv wr...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18304
  
@danielvdende now the newlines are automatically derected. should be not an 
issue anymore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
@MaxGekk, thanks for working on this one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
Last changes were only doc changes. Let me get this in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...

2018-11-23 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/23080
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 12680 matches

Mail list logo