[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21618
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2953/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21618
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216147915
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -656,6 +656,25 @@ object SQLConf {
   .intConf
   .createWithDefault(1)
 
+  val PARALLEL_GET_GLOBBED_PATH_THRESHOLD =
+buildConf("spark.sql.sources.parallelGetGlobbedPath.threshold")
+  .doc("The maximum number of subfiles or directories allowed after a 
globbed path " +
+"expansion.")
+  .intConf
+  .checkValue(threshold => threshold >= 0, "The maximum number of 
subfiles or directories " +
--- End diff --

Maybe we should keep this public? Because the parallel only opened when the 
thread number > 0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216147921
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
--- End diff --

Thanks, done in 1319cd3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216147919
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
+   */
+  private def getGlobbedPaths(
--- End diff --

Thanks, that's more clear, done in 1319cd3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21618
  
**[Test build #95843 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95843/testReport)**
 for PR 21618 at commit 
[`1319cd3`](https://github.com/apache/spark/commit/1319cd38d24c24adccc1c318001814da58a3b22b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216147889
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
+   */
+  private def getGlobbedPaths(
+  sparkSession: SparkSession,
--- End diff --

Thanks for advise, done in 1319cd3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216147887
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1557,6 +1576,15 @@ class SQLConf extends Serializable with Logging {
   def parallelPartitionDiscoveryParallelism: Int =
 getConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_PARALLELISM)
 
+  def parallelGetGlobbedPathThreshold: Int =
+getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_THRESHOLD)
+
+  def parallelGetGlobbedPathNumThreads: Int =
+getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_NUM_THREADS)
+
+  def parallelGetGlobbedPathEnabled: Boolean =
--- End diff --

Thanks, done in 1319cd3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22369: [SPARK-25072][DOC] Update migration guide for beh...

2018-09-08 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/22369#discussion_r216147674
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1901,6 +1901,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 ## Upgrading From Spark SQL 2.3.0 to 2.3.1 and above
 
   - As of version 2.3.1 Arrow functionality, including `pandas_udf` and 
`toPandas()`/`createDataFrame()` with `spark.sql.execution.arrow.enabled` set 
to `True`, has been marked as experimental. These are still evolving and not 
currently recommended for use in production.
+  - In version 2.3.1 and earlier, it is possible for PySpark to create a 
Row object by providing more value than column number through the customized 
Row class. Since Spark 2.3.3, Spark will confirm value length is less or equal 
than column length in PySpark. See 
[SPARK-25072](https://issues.apache.org/jira/browse/SPARK-25072) for details.
--- End diff --

Maybe say `..by providing more values than number of fields through a 
customized Row class. As of Spark 2.3.3, PySpark will raise a ValueError if the 
number of values are more than the number of fields. See...`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.mem...

2018-09-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22298


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-08 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/22298
  
Merged to master (e.g. 3). It's not a bug fix but I _think_ we should 
consider this for backport to 2.4 since it's arguably the second half of a 
feature that's in 2.4, but it's doesn't backport cleanly as is so maybe another 
PR just for the 2.4 branch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-08 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/22140
  
@gatorsmile it seemed like a straightforward bug to me. Rows with extra 
values lead to incorrect output and exceptions when used in `DataFrames`, so it 
did not seem like there was any possible this would break existing code. For 
example

```
In [1]: MyRow = Row('a','b')

In [2]: print(MyRow(1,2,3))
Row(a=1, b=2)

In [3]: spark.createDataFrame([MyRow(1,2,3)])
Out[3]: DataFrame[a: bigint, b: bigint]

In [4]: spark.createDataFrame([MyRow(1,2,3)]).show()
18/09/08 21:55:48 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.IllegalStateException: Input row doesn't have expected number of 
values required by the schema. 2 fields are required while 3 values are 
provided.

In [5]: spark.createDataFrame([MyRow(1,2,3)], schema="x: int, y: 
int").show()

ValueError: Length of object (3) does not match with length of fields (2)
```
Maybe I was too hasty with backporting and this needed some discussion. Do 
you know of a use case that this change would break?





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22370: don't link to deprecated function

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22370
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22370: don't link to deprecated function

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22370
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22370: don't link to deprecated function

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22370
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22370: don't link to deprecated function

2018-09-08 Thread MichaelChirico
GitHub user MichaelChirico opened a pull request:

https://github.com/apache/spark/pull/22370

don't link to deprecated function

Seems misleading to (without qualification) link to a deprecated function

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MichaelChirico/spark patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22370.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22370


commit e8b0d6333a1c09787e1c37a6f91eb895dee8fa72
Author: Michael Chirico 
Date:   2018-09-09T05:12:27Z

don't link to deprecated function

Seems misleading to (without qualification) link to a deprecated function




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22369
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95842/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22369
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22369
  
**[Test build #95842 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95842/testReport)**
 for PR 22369 at commit 
[`d257a38`](https://github.com/apache/spark/commit/d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22369
  
**[Test build #95842 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95842/testReport)**
 for PR 22369 at commit 
[`d257a38`](https://github.com/apache/spark/commit/d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22368
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2952/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22369
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22369: [SPARK-25072][DOC] Update migration guide for behavior c...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22369
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2951/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22368
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/22140
  
```
@xuanyuanking Could you please update the document?
```
#22369 Thanks for reminding, I'll pay attention in future work.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22369: [SPARK-25072][DOC] Update migration guide for beh...

2018-09-08 Thread xuanyuanking
GitHub user xuanyuanking opened a pull request:

https://github.com/apache/spark/pull/22369

[SPARK-25072][DOC] Update migration guide for behavior change

## What changes were proposed in this pull request?

Update the document for the behavior change in PySpark Row creation.

## How was this patch tested?

Existing UT.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xuanyuanking/spark SPARK-25072-DOC

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22369


commit d257a38c647b45a9e83a2bdbbd2814f1b3fc5d56
Author: Yuanjian Li 
Date:   2018-09-09T04:26:23Z

Update doc for SPARK-25072




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22368
  
**[Test build #95841 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95841/testReport)**
 for PR 22368 at commit 
[`865e0af`](https://github.com/apache/spark/commit/865e0af572edad7fd775c25e317055ffa0df2a08).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22368: [SPARK-25368][SQL] Incorrect predicate pushdown r...

2018-09-08 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/22368

[SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result

## What changes were proposed in this pull request?
How to reproduce:
```scala
val df1 = spark.createDataFrame(Seq(
   (1, 1)
)).toDF("a", "b").withColumn("c", lit(null).cast("int"))
val df2 = df1.union(df1).withColumn("d", 
spark_partition_id).filter($"c".isNotNull)
df2.show

+---+---++---+
|  a|  b|   c|  d|
+---+---++---+
|  1|  1|null|  0|
|  1|  1|null|  1|
+---+---++---+
```
`filter($"c".isNotNull)`changed to `(null <=> c#10)` before 
https://github.com/apache/spark/pull/19201, but it changed to `(c#10 = null)` 
since https://github.com/apache/spark/pull/20155. This pr revert it to `(null 
<=> c#10)` to fix this issue.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-25368

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22368.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22368


commit 86b9b7892c94be68145453f9519e35a3574fe568
Author: Yuming Wang 
Date:   2018-09-09T03:46:18Z

Fix SPARK-25368

commit 865e0af572edad7fd775c25e317055ffa0df2a08
Author: Yuming Wang 
Date:   2018-09-09T04:22:29Z

Fix InferFiltersFromConstraintsSuite test error




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...

2018-09-08 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/22010
  
Actually @holdenk is this change even correct? RDD.distinct is not key 
based. It is based on the value of the elements in RDD. Even if `numPartitions 
== partitions.length`, it doesn't mean the RDD is hash partitioned this way.

Consider this RDD:

Partition 1: 1, 2, 3
Partition 2: 1, 2, 3

rdd.distinct() should return 1, 2, 3

with your change it'd still return 1, 2, 3, 1, 2, 3.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22010: [SPARK-21436][CORE] Take advantage of known parti...

2018-09-08 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22010#discussion_r216145892
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -396,7 +396,26 @@ abstract class RDD[T: ClassTag](
* Return a new RDD containing the distinct elements in this RDD.
*/
   def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): 
RDD[T] = withScope {
-map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
+partitioner match {
--- End diff --

you can just create a new MapPartitionsRDD with preservesPartitioning set 
to true, can't you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22366
  
Is it better to add a description to `docs/sql-programming-guide.md`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...

2018-09-08 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/22360
  
Do we need to set `distanceMeasure` again for the parent model ?
When parent model created, it will use the same `distanceMeasure` with the 
one used in training.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22367
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95840/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22367
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22367
  
**[Test build #95840 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95840/testReport)**
 for PR 22367 at commit 
[`7eac385`](https://github.com/apache/spark/commit/7eac385568c78735bb7743cfcfa234c4bea97fb0).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-08 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/22357
  
Thanks! @mallman 

For the first query, I think the query plan produced by your WIP patch is 
not correct. We don't need to read the `company:struct` from `employer:struct`.

For the second, your WIP patch doesn't push down `IsNotNull(employer)` 
predicate into 
`FileScan` node.

That is the important difference I noticed for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22366
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95839/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22366
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22366
  
**[Test build #95839 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95839/testReport)**
 for PR 22366 at commit 
[`f950845`](https://github.com/apache/spark/commit/f9508458d1963e83c7fc23106dc4cb2f1f491524).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22367
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22367
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-09-08 Thread MaxGekk
Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/22234
  
@gatorsmile @HyukjinKwon Please, take a look at #22367 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty stri...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22367
  
**[Test build #95840 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95840/testReport)**
 for PR 22367 at commit 
[`7eac385`](https://github.com/apache/spark/commit/7eac385568c78735bb7743cfcfa234c4bea97fb0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22367: [SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix emp...

2018-09-08 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22367

[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as 
null when nullValue is set.

## What changes were proposed in this pull request?

In the PR, I propose new CSV option `emptyValue` and an update in the SQL 
Migration Guide which describes how to revert previous behavior when empty 
strings were not written at all. Since Spark 2.4, empty strings are saved as 
`""` to distinguish them from saved `null`s.

## How was this patch tested?

It was tested by `CSVSuite` and new tests added in the PR #22234


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 csv-empty-value-2.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22367


commit 465ed7a6011bd0437c7f88cb4c18ecea68cb60ac
Author: Mario Molina 
Date:   2018-08-25T17:42:03Z

Configurable empty values when reading/writing CSV files

commit 48e143d43a876afc4f0099bf7079130d74ebe855
Author: Mario Molina 
Date:   2018-08-26T23:29:32Z

Adding tests

commit 70e217146962186a391227f1417cf79c5e81c380
Author: Mario Molina 
Date:   2018-08-26T23:33:55Z

Changing emptyValue order arg in streaming.py

commit 8665f93c442915dc23a40ffb3c958a097dec34c5
Author: Mario Molina 
Date:   2018-08-27T02:03:41Z

Changing emptyValue order arg in set_opts

commit 867c6de34673bbc877e0e26e8c0d662e038e2946
Author: Maxim Gekk 
Date:   2018-09-08T20:40:41Z

Added comments for parameters

commit e0cb879f3bc28f66e19d049ed0ee6dc33fc5922c
Author: Maxim Gekk 
Date:   2018-09-08T21:02:21Z

Updating the migration guide

commit e23098c5a6322ab3cff851b37889163c9bd09491
Author: Mario Molina 
Date:   2018-08-26T23:28:34Z

Changing order in args for emptyValue

commit 732ec78c8d376bad0cc8897b1da48a56448590fb
Author: Maxim Gekk 
Date:   2018-09-08T21:11:56Z

Revert some checking

commit 7eac385568c78735bb7743cfcfa234c4bea97fb0
Author: Maxim Gekk 
Date:   2018-09-08T21:14:13Z

Revert unneeded changes




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22353: [SPARK-25357][SQL] Abbreviated simpleString in DataSourc...

2018-09-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22353
  
@LantaoJin . Please check the following example in Spark UI; the hover text 
on `Scan parquet`.
```scala
scala> 
spark.range(2).repartition(1).write.mode("overwrite").parquet("/tmp/1")
scala> spark.read.parquet("/tmp/1/*").count
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...

2018-09-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22363
  
Thank you, @gatorsmile !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95836/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95836 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)**
 for PR 22365 at commit 
[`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95833/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22337
  
**[Test build #95833 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95833/testReport)**
 for PR 22337 at commit 
[`309e265`](https://github.com/apache/spark/commit/309e265f64a856f46c10d5310a07417e0abd0dab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22366
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22366
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNu...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22366
  
**[Test build #95839 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95839/testReport)**
 for PR 22366 at commit 
[`f950845`](https://github.com/apache/spark/commit/f9508458d1963e83c7fc23106dc4cb2f1f491524).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22366: [SPARK-25384][SQL] Removing of spark.sql.fromJson...

2018-09-08 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22366

[SPARK-25384][SQL] Removing of spark.sql.fromJsonForceNullableSchema

## What changes were proposed in this pull request?

In the PR, I propose to remove the `spark.sql.fromJsonForceNullableSchema` 
flag since disabling it can cause corrupted output. The flag was introduced 
only for backward compatibility in minor versions. The PR targets Spark 3.0 in 
which the flag can be removed.

## How was this patch tested?

It was tested by `JsonExpressionsSuite`, `JsonFunctionsSuite` and 
`JsonSuite`


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 
json-remove-non-nullable-schema

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22366.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22366


commit 0de349bc59143eeb26cd422fdfe945037f8353ac
Author: Maxim Gekk 
Date:   2018-09-08T19:33:20Z

Removing the spark.sql.fromJsonForceNullableSchema flag

commit f9508458d1963e83c7fc23106dc4cb2f1f491524
Author: Maxim Gekk 
Date:   2018-09-08T19:46:05Z

Bug fix - missing field must not nullable




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...

2018-09-08 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22360
  
Yes, I think the point here is that the parameter is part of 
`BisectingKMeansParams` which defines as final the getter method. I think 
`KMeans` has the same issue. We can probably remove this and set the 
distanceMeasure from the parent model at creation time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...

2018-09-08 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22364
  
cc @gatorsmile @maropu 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21273: [SPARK-17916][SQL] Fix empty string being parsed ...

2018-09-08 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21273#discussion_r216138533
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
@@ -164,7 +164,7 @@ class CSVOptions(
 
writerSettings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceFlagInWrite)
 
writerSettings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceFlagInWrite)
 writerSettings.setNullValue(nullValue)
-writerSettings.setEmptyValue(nullValue)
+writerSettings.setEmptyValue("\"\"")
--- End diff --

This needs an update in migration guide. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22234: [SPARK-25241][SQL] Configurable empty values when readin...

2018-09-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22234
  
@MaxGekk Could you take this PR over? I think we need to merge this to 
Spark 2.4. Users can set the behaviors to the previous one by this new conf 
`emptyValue`, if needed. Also update the migration guide about the behavior 
change and explain how to set `emptyValue`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95832/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22337
  
**[Test build #95832 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95832/testReport)**
 for PR 22337 at commit 
[`a314776`](https://github.com/apache/spark/commit/a3147760b025e6592dd80d858ae4757bd907a72c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row

2018-09-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22140
  
@BryanCutler What is the reason to backport this PR? This sounds a behavior 
change. 

@xuanyuanking Could you please update the document?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17899: [SPARK-20636] Add new optimization rule to transp...

2018-09-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17899


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95838/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19045
  
**[Test build #95838 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95838/testReport)**
 for PR 19045 at commit 
[`5877c16`](https://github.com/apache/spark/commit/5877c16e20559122847ed5ea21c74214fc024c9d).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95837/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19045
  
**[Test build #95837 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95837/testReport)**
 for PR 19045 at commit 
[`0ba0ca5`](https://github.com/apache/spark/commit/0ba0ca5551d106cd621097b510fa8fb373f171f9).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19045
  
**[Test build #95838 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95838/testReport)**
 for PR 19045 at commit 
[`5877c16`](https://github.com/apache/spark/commit/5877c16e20559122847ed5ea21c74214fc024c9d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2950/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19045
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2949/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/19045
  
cc @ifilonenko it's super WIP but since you joined me on the stream where I 
was working on reviving this I thought it would be good to get your early 
comments (especially if you have any suggestions around making effective 
integration tests for this).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19045: [WIP][SPARK-20628][CORE] Keep track of nodes (/ spot ins...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19045
  
**[Test build #95837 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95837/testReport)**
 for PR 19045 at commit 
[`0ba0ca5`](https://github.com/apache/spark/commit/0ba0ca5551d106cd621097b510fa8fb373f171f9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22010: [SPARK-21436][CORE] Take advantage of known partitioner ...

2018-09-08 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/22010
  
Hey @rxin & @cloud-fan I'd really appreciate your input on the tricks I did 
to keep the partioniner information present -- is this the right approach?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

2018-09-08 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/21654
  
cc @rgbkrk 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm....

2018-09-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22363


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95836 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95836/testReport)**
 for PR 22365 at commit 
[`2845bca`](https://github.com/apache/spark/commit/2845bca09797a34e930e6aca42f198ec5cbd95e3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95835 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)**
 for PR 22365 at commit 
[`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95835/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95835 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95835/testReport)**
 for PR 22365 at commit 
[`7e77941`](https://github.com/apache/spark/commit/7e7794153924b824dc5fe5f05375c8b9950ef539).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95834/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95834 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)**
 for PR 22365 at commit 
[`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22365
  
**[Test build #95834 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95834/testReport)**
 for PR 22365 at commit 
[`e2e6149`](https://github.com/apache/spark/commit/e2e61498c47da9d7b36d2e0727ce8642d5d71472).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22365: [SPARK-25381][SQL] Stratified sampling by Column argumen...

2018-09-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22365
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22365: [SPARK-25381][SQL] Stratified sampling by Column ...

2018-09-08 Thread MaxGekk
GitHub user MaxGekk opened a pull request:

https://github.com/apache/spark/pull/22365

[SPARK-25381][SQL] Stratified sampling by Column argument

## What changes were proposed in this pull request?

In the PR, I propose to add an overloaded method for `sampleBy` which 
accepts the first argument of the `Column` type. This will allow to sample by 
any complex columns as well as sampling by multiple columns. For example:

```Scala
spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 
17),
  ("Alice", 10))).toDF("name", "age")
  .stat
  .sampleBy(struct($"name", $"age"), Map(Row("Alice", 10) -> 0.3, 
Row("Nico", 8) -> 1.0), 36L)
  .show()

+-+---+
| name|age|
+-+---+
| Nico|  8|
|Alice| 10|
+-+---+
```

## How was this patch tested?

Added new test for sampling by multiple columns for Scala and test for 
Java, Python to check that `sampleBy` is able to sample by `Column` type 
argument.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MaxGekk/spark-1 sample-by-column

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22365


commit 3832f2137676a76d6d06a0bb6dbcedcba801910b
Author: Maxim Gekk 
Date:   2018-09-08T13:30:49Z

Adding overloaded sampleBy with Column type

commit 5cd3229ce8bfe894dac8ebc097109da237d95401
Author: Maxim Gekk 
Date:   2018-09-08T13:39:30Z

Adding overloaded sampleBy with Column type for Java

commit e2e61498c47da9d7b36d2e0727ce8642d5d71472
Author: Maxim Gekk 
Date:   2018-09-08T14:56:36Z

Adding overloaded sampleBy with Column type for Python




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...

2018-09-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22349


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema

2018-09-08 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22349
  
LGTM. Merged into master and branch-2.4. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216134032
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

This seems to cause a regression on Spark Web UI. Could you check that, 
@LantaoJin ?

In fact, the abbreviation is introduced over two years ago at Spark 2.0 
intentionally for UI via [[SPARK-14476][SQL] Improve the physical plan 
visualization by adding meta info like table name and file path for data 
source](https://github.com/apache/spark/pull/12947). At least, we had better 
update the information of PR and JIRA.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21618: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/21618
  
@kiszk @maropu Great thanks for your review and advise! I'll address them 
and resolve the conflicts ASAP.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-08 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216133261
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkHadoopUtilSuite.scala ---
@@ -77,6 +80,51 @@ class SparkHadoopUtilSuite extends SparkFunSuite with 
Matchers {
 })
   }
 
+  test("test expanding glob path") {
--- End diff --

```
IIUC, the new feature is disabled as default since 
spark.sql.sources.parallelGetGlobbedPath.numThreads is 0.
```
Yes that's right.

```
I am afraid these test causes are executed only with disabling the new 
feature.
```
These mainly test the correctness of `sparkHadoopUtil.expandGlobPath`, 
maybe it's necessary to keep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22359
  
Since this is related to Parquet behavior only, can we have `in Parquet` at 
the end of title specifically?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...

2018-09-08 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22363
  
cc @cloud-fan and @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >