[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22213
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95820/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22213
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22213
  
**[Test build #95820 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95820/testReport)**
 for PR 22213 at commit 
[`603fdc9`](https://github.com/apache/spark/commit/603fdc9c148e2a36393140054c559f5a0a380e5d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22311: [WIP][SPARK-25305][SQL] Respect attribute name in Collap...

2018-09-07 Thread gengliangwang
Github user gengliangwang commented on the issue:

https://github.com/apache/spark/pull/22311
  
@maropu Let's keep the change of #22320 for a while. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...

2018-09-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21649#discussion_r216122842
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3905,6 +3905,16 @@ setMethod("rollup",
 groupedData(sgd)
   })
 
+isTypeAllowedForSqlHint <- function(x) {
+  if (is.character(x) || is.numeric(x)) {
+TRUE
+  } else if (is.list(x)) {
+all(sapply(x, (function(y) is.character(y) || is.numeric(y
--- End diff --

also, if it is a `list` could we clarify if it is supposed to work with 
multiple hint in different types in that list (this might be "unique" to R), 
for example

```
> x <- list("a", 3)
> all(sapply(x, function(y) { is.character(y) || is.numeric(y) } ))
[1] TRUE

> x <- list("a", NA)
> all(sapply(x, function(y) { is.character(y) || is.numeric(y) } ))
[1] FALSE
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22359
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95818/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22359
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22359
  
**[Test build #95818 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95818/testReport)**
 for PR 22359 at commit 
[`71f382b`](https://github.com/apache/spark/commit/71f382bd7c9fdfda38b1bc8063b2a55dd56c00b4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...

2018-09-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/21649#discussion_r216122804
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3905,6 +3905,16 @@ setMethod("rollup",
 groupedData(sgd)
   })
 
+isTypeAllowedForSqlHint <- function(x) {
+  if (is.character(x) || is.numeric(x)) {
+TRUE
+  } else if (is.list(x)) {
+all(sapply(x, (function(y) is.character(y) || is.numeric(y
--- End diff --

I look into this more deeply, I think this style seems a bit odd, as a nit, 
I think this should be

`all(sapply(x, function(y) { is.character(y) || is.numeric(y) } ))` 

think it's more readable this way. also see L2458 for an example 
https://github.com/apache/spark/blob/aec391c9dcb6362874736e663d435f9dd8400125/R/pkg/R/DataFrame.R#L2458


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22144: [SPARK-24935][SQL] : Problem with Executing Hive UDF's f...

2018-09-07 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/22144
  
hey, this looks important, could someone review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize...

2018-09-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/22362#discussion_r216122659
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -199,8 +199,8 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 numExecutors = Option(numExecutors)
   .getOrElse(sparkProperties.get("spark.executor.instances").orNull)
 queue = 
Option(queue).orElse(sparkProperties.get("spark.yarn.queue")).orNull
-keytab = 
Option(keytab).orElse(sparkProperties.get("spark.yarn.keytab")).orNull
-principal = 
Option(principal).orElse(sparkProperties.get("spark.yarn.principal")).orNull
+keytab = 
Option(keytab).orElse(sparkProperties.get("spark.kerberos.keytab")).orNull
--- End diff --

agreed, shouldn't the "old" config still work? `spark.yarn.keytab` etc


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22227: [SPARK-25202] [SQL] Implements split with limit s...

2018-09-07 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/7#discussion_r216122621
  
--- Diff: R/pkg/R/functions.R ---
@@ -3404,19 +3404,24 @@ setMethod("collect_set",
 #' Equivalent to \code{split} SQL function.
 #'
 #' @rdname column_string_functions
+#' @param limit determines the size of the returned array. If `limit` is 
positive,
+#'size of the array will be at most `limit`. If `limit` is 
negative, the
--- End diff --

you can't use backtick in R doc


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19773: [SPARK-22546][SQL] Supporting for changing column...

2018-09-07 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/19773#discussion_r216122599
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -350,16 +366,11 @@ case class AlterTableChangeColumnCommand(
 s"${schema.fieldNames.mkString("[`", "`, `", "`]")}"))
   }
 
-  // Add the comment to a column, if comment is empty, return the original 
column.
-  private def addComment(column: StructField, comment: Option[String]): 
StructField = {
-comment.map(column.withComment(_)).getOrElse(column)
-  }
-
--- End diff --

Thanks for your question, actually that's also what I'm consider during do 
the compatible check. Hive do this column type change work in 
[HiveAlterHandler](https://github.com/apache/hive/blob/3287a097e31063cc805ca55c2ca7defffe761b6f/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L175
) and the detailed compatible check is in 
[ColumnType](https://github.com/apache/hive/blob/3287a097e31063cc805ca55c2ca7defffe761b6f/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ColumnType.java#L206).
 You can see in the ColumnType checking work, it actually use the `canCast` 
semantic to judge compatible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22262
  
The following is the sequence.

```scala
scala> sql("insert overwrite local directory '/tmp/parquet' stored as 
parquet select 1 id, 2 id")
```

```
$ parquet-tools schema /tmp/parquet
message hive_schema {
  optional int32 id;
  optional int32 id;
}
```

```scala
scala> sql("create table parquet(id int) USING parquet LOCATION 
'/tmp/parquet'")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet")
res4: org.apache.spark.sql.DataFrame = [id: int]

scala> sql("select * from parquet").show
18/09/07 23:31:03 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.RuntimeException: [id] INT32 was added twice
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema

2018-09-07 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22349
  
@WeichenXu123 Could you address the comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22359
  
Oh, I it seems to be the Parquet behavior from the beginning of this 
command at 2.3.0. I was confused because it's different from ORC.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-07 Thread LantaoJin
Github user LantaoJin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216122293
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

It’s not for debugging, it’s for offline analysis based on event log.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-07 Thread LantaoJin
Github user LantaoJin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216122273
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

Spark users don’t care about it but platform team care. The purpose is to 
add the missing information back in event log since PR#18600


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21791: [SPARK-24925][SQL] input bytesRead metrics fluctuate fro...

2018-09-07 Thread yucai
Github user yucai commented on the issue:

https://github.com/apache/spark/pull/21791
  
@kiszk I see #22324 has been solved, which is one of my PR's dependency 
actually (see 
https://issues.apache.org/jira/browse/SPARK-24925?focusedCommentId=16556818&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16556818)
 , I am stuck for that for a long time, I will update this one recently. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21791: [SPARK-24925][SQL] input bytesRead metrics fluctuate fro...

2018-09-07 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/21791
  
gentle ping  @yucai 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216122062
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

We already have `spark.debug.maxToStringFields` for debugging, so is it bad 
to add `spark.debug.`? Anyway, if most users don't care about this, I think 
changing the default number goes too far.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
> ... we need this duplication check in case-sensitive mode ...
Do you mean we may define ORC/Parquet schema with identical field names 
(even in the same letter case)? Would you please explain a bit more on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...

2018-09-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22358
  
just fyi about related talks: 
https://github.com/apache/spark/pull/21070#issuecomment-382086510
cc: @rdblue


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@HyukjinKwon @cloud-fan @gatorsmile Could you please kindly help review 
this if you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22262
  
BTW, I think we need this duplication check in case-sensitive mode, too. 
I'll ping on previous Parquet PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r216121746
  
--- Diff: docs/sql-programming-guide.md ---
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 Sets the compression codec used when writing Parquet files. If either 
`compression` or
 `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be
 `compression`, `parquet.compression`, 
`spark.sql.parquet.compression.codec`. Acceptable values include:
-none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
--- End diff --

ah, ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-07 Thread ifilonenko
Github user ifilonenko commented on the issue:

https://github.com/apache/spark/pull/22298
  
@holdenk or @mccheah for merge 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...

2018-09-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22262#discussion_r216121510
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -116,6 +116,14 @@ object OrcUtils extends Logging {
 })
   } else {
 val resolver = if (isCaseSensitive) caseSensitiveResolution else 
caseInsensitiveResolution
+// Need to fail if there is ambiguity, i.e. more than one field is 
matched.
+requiredSchema.fieldNames.foreach { requiredFieldName =>
+  val matchedOrcFields = orcFieldNames.filter(resolver(_, 
requiredFieldName))
--- End diff --

That's all right :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22262
  
**[Test build #95825 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95825/testReport)**
 for PR 22262 at commit 
[`26b4710`](https://github.com/apache/spark/commit/26b4710bb47e700d3532da5652fd5ea3f128be20).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21654: [SPARK-24671][PySpark] DataFrame length using a d...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21654#discussion_r216121454
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -375,6 +375,9 @@ def _truncate(self):
 return int(self.sql_ctx.getConf(
 "spark.sql.repl.eagerEval.truncate", "20"))
 
+def __len__(self):
--- End diff --

Can we better just not define this? RDD doesn't have this one too. IMHO, 
such allowing bit by bit wouldn't be so ideal .. For example, `columns.py` 
ended up with a weird limit:

```python
>>> iter(spark.range(1).id)
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
>>> isinstance(spark.range(1).id, collections.Iterable)
True
```

It makes a general sense though.

This `__iter__` can't be removed BTW because we implement `__getitem__` and 
`__getattr__` to access columns in dataframes IIRC.

`__repr__` was added because it's commonly used and it had a strong usecase 
for notebook, etc. However, for `len()` I wouldn't add it for now. Think about 
`if len(df) ...` and it is eagerly evaluated .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
@dongjoon-hyun That's all right :). I have reverted to the first commit and 
adjusted the indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22262
  
**[Test build #95824 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95824/testReport)**
 for PR 22262 at commit 
[`366bb35`](https://github.com/apache/spark/commit/366bb35ad62edb8e6707e65c681a5b9001cc868e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2943/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22337
  
**[Test build #95822 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95822/testReport)**
 for PR 22337 at commit 
[`a314776`](https://github.com/apache/spark/commit/a3147760b025e6592dd80d858ae4757bd907a72c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22360: [MINOR][ML] Remove `BisectingKMeansModel.setDistanceMeas...

2018-09-07 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22360
  
Oh right, the model, not the implementation. CC @mgaido91 too. OK we can 
take it out because it was only introduced for 2.4.0.

Hm, so the model still has a distanceMeasure, but it's just not something 
that should be settable.

The bisecting k-means model just wraps the older .mllib model. But I don't 
see that its getDistanceMeasure would actually return the value inside that 
wrapped model? I am missing where the wrapper gets updated to return the actual 
distance measure used.

Is that also an issue or did I miss something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21654
  
**[Test build #95823 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95823/testReport)**
 for PR 21654 at commit 
[`4d0afaf`](https://github.com/apache/spark/commit/4d0afaf3cd046b11e8bae43dc00ddf4b1eb97732).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21654: [SPARK-24671][PySpark] DataFrame length using a dunder/m...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21654
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-07 Thread LantaoJin
Github user LantaoJin commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216121128
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

I think it’s overkill to parameterizing it. And Spark user doesn’t care 
about it, no one will reset it before submitting app. Besides, simply raise up 
to 1000 also can resolve the problem on most cases, but longer than 1000 chars 
is still meanlessness.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21180: [SPARK-22674][PYTHON] Disabled _hack_namedtuple for pick...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21180
  
@superbobry, sorry for back and forth. branch-2.4 is cut out. Can we open 
another PR that removes the hack? Let's make sure to leave the potential 
impact, workaround and strong justification (your blog) in the PR description. 
I am still hesitant but good to get rid of it 3.0.0. From a cursory look so 
far, I also think we should eventually get rid of it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22298
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22298
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95813/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22351: [MINOR][SQL] Add a debug log when a SQL text is u...

2018-09-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22351


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22192
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95814/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22298: [SPARK-25021][K8S] Add spark.executor.pyspark.memory lim...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22298
  
**[Test build #95813 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95813/testReport)**
 for PR 22298 at commit 
[`ea25b8b`](https://github.com/apache/spark/commit/ea25b8b14013d17fbafe5c338cca0eb28a490482).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22192: [SPARK-24918][Core] Executor Plugin API

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22192
  
**[Test build #95814 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95814/testReport)**
 for PR 22192 at commit 
[`cf30ed5`](https://github.com/apache/spark/commit/cf30ed52c0aefc1eda8efffc11e73cafc93a032d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22359
  
@wangyum @gengliangwang 

Does this PR aim a correct behavior for `INSERT OVERWRITE LOCAL DIRECTORY`?
In SPARK-25313, it may make sense because the generated file are under the 
table and respects the table definition. However, in this PR, this accidentally 
introduces **case-sensitivity** in SQL SELECT statement.
```
The schema should be StructType(StructField(ID,LongType,true)) as we SELECT 
ID FROM view1.
```

This case seems to require a new SPARK JIRA issue and more discussion on 
the goal. Also, this should be allowed only when 
`spark.sql.caseSensitivity=true`.

cc @gatorsmile @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22351: [MINOR][SQL] Add a debug log when a SQL text is used for...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22351
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22349#discussion_r216120943
  
--- Diff: python/pyspark/ml/image.py ---
@@ -30,6 +30,7 @@
 from pyspark import SparkContext
 from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
 from pyspark.sql import DataFrame, SparkSession
+import warnings
--- End diff --

Technically builtin package should be ordered above per PEP 8. I wonder why 
this is not caught.

```
import sys
import warnings
 
import numpy as np

from pyspark import SparkContext
from pyspark.sql.types import Row, _create_row, _parse_datatype_json_string
from pyspark.sql import DataFrame, SparkSession
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22349: [SPARK-25345][ML] Deprecate public APIs from Imag...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22349#discussion_r216120953
  
--- Diff: python/pyspark/ml/image.py ---
@@ -222,7 +226,8 @@ def readImages(self, path, recursive=False, 
numPartitions=-1,
 
 .. versionadded:: 2.3.0
 """
-
+warnings.warn("`ImageSchema.readImage` is deprecated. " +
+  "Use `spark.read.format(\"image\").load(path)` 
instead.")
--- End diff --

`warnings.warn(..., DeprecationWarning)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...

2018-09-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22337#discussion_r216120896
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala
 ---
@@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with 
BeforeAndAfter with BeforeAn
   }
 
   override def afterAll(): Unit = {
-super.afterAll()
-Utils.deleteRecursively(new File(tempDir))
+try {
+  Utils.deleteRecursively(new File(tempDir))
+} finally {
+  super.afterAll()
--- End diff --

Hmm, I see. Let me keep the original order as 
```
try {
  super.afterAll()
} finally {
  Utils.deleteRecursively(...)
}
```
Thank you for listing them out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCodec are n...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22358
  
If the codecs are found, then we support it. One thing we should do might 
be to document to explicitly provide the codec but I am not sure how many users 
are confused about it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22357: [SPARK-25363][SQL] Fix schema pruning in where clause by...

2018-09-07 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/22357
  
I have reconstructed my original patch for this issue, but I've discovered 
it will require more work to complete. However, as part of that reconstruction 
I've discovered a couple of cases where our patches create different physical 
plans. The query results are the same, but I'm not sure which—if 
either—plan is correct. I want to go into detail on that, but it's 
complicated and I have to call it quits tonight. I have a flight in the 
morning, and I'll be on break next week.

In the meantime, I'll just copy and paste two queries—based on the data 
in `ParquetSchemaPruningSuite.scala`—with two query plans each.

First query:

select employer.id from contacts where employer is not null

This PR (as of d68f808) produces:

```
== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
   +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
ReadSchema: struct>
```

My WIP patch produces:

```
== Physical Plan ==
*(1) Project [employer#4442.id AS id#4452]
+- *(1) Filter isnotnull(employer#4442)
   +- *(1) FileScan parquet [employer#4442,p#4443] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
ReadSchema: 
struct>>
```

Second query:

select employer.id from contacts where employer.id = 0

This PR produces:

```
== Physical Plan ==
*(1) Project [employer#4297.id AS id#4308]
+- *(1) Filter (isnotnull(employer#4297) && (employer#4297.id = 0))
   +- *(1) FileScan parquet [employer#4297,p#4298] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: 
[IsNotNull(employer)],
ReadSchema: struct>
```

My WIP patch produces:

```
== Physical Plan ==
*(1) Project [employer#4445.id AS id#4456]
+- *(1) Filter (isnotnull(employer#4445.id) && (employer#4445.id = 0))
   +- *(1) FileScan parquet [employer#4445,p#4446] Batched: false, Format: 
Parquet,
PartitionCount: 2, PartitionFilters: [], PushedFilters: [],
ReadSchema: struct>
```

I wanted to give my thoughts on the differences of these in detail, but I 
have to wrap up my work for the night. I'll be visiting family next week. I 
don't know how responsive I'll be in that time, but I'll at least try to check 
back.

Cheers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r216120854
  
--- Diff: docs/sql-programming-guide.md ---
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 Sets the compression codec used when writing Parquet files. If either 
`compression` or
 `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be
 `compression`, `parquet.compression`, 
`spark.sql.parquet.compression.codec`. Acceptable values include:
-none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
--- End diff --

I think so given the download page.
![screen shot 2018-09-08 at 12 41 41 
pm](https://user-images.githubusercontent.com/6477701/45250388-94d6b280-b364-11e8-85ee-1a67daa3a123.png)



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22337#discussion_r216120459
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala
 ---
@@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with 
BeforeAndAfter with BeforeAn
   }
 
   override def afterAll(): Unit = {
-super.afterAll()
-Utils.deleteRecursively(new File(tempDir))
+try {
+  Utils.deleteRecursively(new File(tempDir))
+} finally {
+  super.afterAll()
--- End diff --

Ur, are you sure? `StreamingAggregationSuite`, `StreamTest`, 
`StreamingAggregationSuite` and `StreamingDeduplicationSuite` also have reverse 
order for `StateStore.stop()`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22337
  
**[Test build #95821 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95821/testReport)**
 for PR 22337 at commit 
[`6a4cbcf`](https://github.com/apache/spark/commit/6a4cbcf7e7c78c9cd932910f9431b5ea1dd223be).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2942/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22337: [SPARK-25338][Test] Ensure to call super.beforeAll() and...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22337
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...

2018-09-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22337#discussion_r216120240
  
--- Diff: 
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
 ---
@@ -44,20 +44,25 @@ class KafkaRDDSuite extends SparkFunSuite with 
BeforeAndAfterAll {
   private var sc: SparkContext = _
 
   override def beforeAll {
+super.beforeAll()
 sc = new SparkContext(sparkConf)
 kafkaTestUtils = new KafkaTestUtils
 kafkaTestUtils.setup()
   }
 
   override def afterAll {
-if (sc != null) {
-  sc.stop
-  sc = null
-}
-
-if (kafkaTestUtils != null) {
-  kafkaTestUtils.teardown()
-  kafkaTestUtils = null
+try {
+  if (sc != null) {
+sc.stop
+sc = null
+  }
+
+  if (kafkaTestUtils != null) {
+kafkaTestUtils.teardown()
+kafkaTestUtils = null
+  }
+} finally {
+  super.afterAll()
--- End diff --

`sc.stop` can throw an exception. Thus, I updated this using nested 
`try-finally`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...

2018-09-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22337#discussion_r216120228
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
 ---
@@ -51,8 +51,8 @@ class KafkaRelationSuite extends QueryTest with 
SharedSQLContext with KafkaTest
 if (testUtils != null) {
   testUtils.teardown()
   testUtils = null
-  super.afterAll()
 }
+super.afterAll()
--- End diff --

Good catch, added `try-finally`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22262
  
@seancxmao . I mistook yesterday. Could you restore to your first commit? 
In the first commit, please adjust the indentation at [line 
140](https://github.com/apache/spark/pull/22262/commits/366bb35ad62edb8e6707e65c681a5b9001cc868e#diff-3fb8426b690ab771c4f67f9cad336498R140).
 Sorry for the back and forth! 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22337: [SPARK-25338][Test] Ensure to call super.beforeAl...

2018-09-07 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/22337#discussion_r216119902
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreRDDSuite.scala
 ---
@@ -49,8 +49,11 @@ class StateStoreRDDSuite extends SparkFunSuite with 
BeforeAndAfter with BeforeAn
   }
 
   override def afterAll(): Unit = {
-super.afterAll()
-Utils.deleteRecursively(new File(tempDir))
+try {
+  Utils.deleteRecursively(new File(tempDir))
+} finally {
+  super.afterAll()
--- End diff --

Sure, let me change the order in `FlatMapGroupsWithStateSuite `, too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22363
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22363
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95816/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22363: [SPARK-25375][SQL][TEST] Reenable qualified perm. functi...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22363
  
**[Test build #95816 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95816/testReport)**
 for PR 22363 at commit 
[`e71920e`](https://github.com/apache/spark/commit/e71920ec6e9c2560e39d01943dda0f49716ea1b7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-09-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22162
  
@AndrewKL Thanks for your work! I think we can add enough tests for this 
patch without using a spy library.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r216119774
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
 checkShowString(ds, expected)
   }
 
+  test("SPARK-24442 Show should follow spark.show.default.number.of.rows") 
{
+withSQLConf("spark.sql.show.defaultNumRows" -> "100") {
+  (1 to 1000).toDS().as[Int].show
+}
+  }
+
+  test("SPARK-24442 Show should follow 
spark.show.default.truncate.characters.per.column") {
+withSQLConf("spark.sql.show.truncateMaxCharsPerColumn" -> "30") {
+  (1 to 1000).map(x => "123456789_123456789_123456789_")
+.toDS().as[String]
+.show
--- End diff --

Plz compare the show result?

https://github.com/apache/spark/blob/9241e1e7e66574cfafa68791771959dfc39c9684/sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala#L326


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22162: [spark-24442][SQL] Added parameters to control the defau...

2018-09-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22162
  
Can you fix the style errors?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r216119706
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
 checkShowString(ds, expected)
   }
 
+  test("SPARK-24442 Show should follow spark.show.default.number.of.rows") 
{
+withSQLConf("spark.sql.show.defaultNumRows" -> "100") {
+  (1 to 1000).toDS().as[Int].show
+}
+  }
+
+  test("SPARK-24442 Show should follow 
spark.show.default.truncate.characters.per.column") {
+withSQLConf("spark.sql.show.truncateMaxCharsPerColumn" -> "30") {
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r216119703
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -969,6 +969,20 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
 checkShowString(ds, expected)
   }
 
+  test("SPARK-24442 Show should follow spark.show.default.number.of.rows") 
{
+withSQLConf("spark.sql.show.defaultNumRows" -> "100") {
--- End diff --

`SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS.key -> "100"`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r216119684
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1950,6 +1962,11 @@ class SQLConf extends Serializable with Logging {
   def parallelFileListingInStatsComputation: Boolean =
 getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)
 
+  def sqlShowDefaultMaxRecordsToShow: Int = 
getConf(SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS)
+
+  def sqlShowDefaultMaxCharsPerTruncatedRow: Int =
--- End diff --

`sqlShowTruncateMaxCharsPerColumn`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22162: [spark-24442][SQL] Added parameters to control th...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22162#discussion_r216119676
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1950,6 +1962,11 @@ class SQLConf extends Serializable with Logging {
   def parallelFileListingInStatsComputation: Boolean =
 getConf(SQLConf.PARALLEL_FILE_LISTING_IN_STATS_COMPUTATION)
 
+  def sqlShowDefaultMaxRecordsToShow: Int = 
getConf(SQLConf.SQL_SHOW_DEFAULT_MAX_ROWS)
--- End diff --

`sqlShowDefaultMaxRows`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22311: [WIP][SPARK-25305][SQL] Respect attribute name in Collap...

2018-09-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22311
  
Can this pr include the revert code for the #22320 change? Or, we should 
keep the change for safeguard for a while? (I thinks we'd be better to choose 
the second approach, so this is just a question).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22362
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22362
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95812/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22362
  
**[Test build #95812 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95812/testReport)**
 for PR 22362 at commit 
[`c5a7fec`](https://github.com/apache/spark/commit/c5a7fec7d8f6784e50b14fafb9575eb4d23208bc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22358#discussion_r216119384
  
--- Diff: docs/sql-programming-guide.md ---
@@ -964,7 +964,7 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 Sets the compression codec used when writing Parquet files. If either 
`compression` or
 `parquet.compression` is specified in the table-specific 
options/properties, the precedence would be
 `compression`, `parquet.compression`, 
`spark.sql.parquet.compression.codec`. Acceptable values include:
-none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
--- End diff --

hadoop-2.9.x is officially supported in Spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22362
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22362
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95811/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22362: [SPARK-25372][YARN][K8S] Deprecate and generalize keytab...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22362
  
**[Test build #95811 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95811/testReport)**
 for PR 22362 at commit 
[`30286f8`](https://github.com/apache/spark/commit/30286f85ac315cad2578fe2e11c8718b8883630a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22262#discussion_r216119226
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -116,6 +116,14 @@ object OrcUtils extends Logging {
 })
   } else {
 val resolver = if (isCaseSensitive) caseSensitiveResolution else 
caseInsensitiveResolution
+// Need to fail if there is ambiguity, i.e. more than one field is 
matched.
+requiredSchema.fieldNames.foreach { requiredFieldName =>
+  val matchedOrcFields = orcFieldNames.filter(resolver(_, 
requiredFieldName))
--- End diff --

Sorry, @seancxmao . We need to update once more. I'll make a PR to you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22353: [SPARK-25357][SQL] Abbreviated simpleString in Da...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22353#discussion_r216119211
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -54,7 +54,7 @@ trait DataSourceScanExec extends LeafExecNode with 
CodegenSupport {
   override def simpleString: String = {
 val metadataEntries = metadata.toSeq.sorted.map {
   case (key, value) =>
-key + ": " + StringUtils.abbreviate(redact(value), 100)
--- End diff --

How about parameterizing the second value as a new `SQLConf`? Or, defining 
`verboseString`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22349
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95817/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22349
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22349: [SPARK-25345][ML] Deprecate public APIs from ImageSchema

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22349
  
**[Test build #95817 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95817/testReport)**
 for PR 22349 at commit 
[`5cefd23`](https://github.com/apache/spark/commit/5cefd23b3842736632c29c40ece1a2251d56efaf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22347: [SPARK-25353][SQL] executeTake in SparkPlan is modified ...

2018-09-07 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22347
  
btw, do we have any actual performance benefit (wall time) from this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118997
  
--- Diff: core/src/main/java/org/apache/hadoop/fs/SparkGlobber.java ---
@@ -0,0 +1,293 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.fs;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.commons.lang3.StringUtils;
+import org.apache.commons.logging.LogFactory;
+import org.apache.commons.logging.Log;
+
+/**
+ * This is based on hadoop-common-2.7.2
+ * {@link org.apache.hadoop.fs.Globber}.
+ * This class exposes globWithThreshold which can be used glob path in 
parallel.
+ */
+public class SparkGlobber {
+  public static final Log LOG = 
LogFactory.getLog(SparkGlobber.class.getName());
+
+  private final FileSystem fs;
+  private final FileContext fc;
+  private final Path pathPattern;
+
+  public SparkGlobber(FileSystem fs, Path pathPattern) {
+this.fs = fs;
+this.fc = null;
+this.pathPattern = pathPattern;
+  }
+
+  public SparkGlobber(FileContext fc, Path pathPattern) {
+this.fs = null;
+this.fc = fc;
+this.pathPattern = pathPattern;
+  }
+
+  private FileStatus getFileStatus(Path path) throws IOException {
+try {
+  if (fs != null) {
+return fs.getFileStatus(path);
+  } else {
+return fc.getFileStatus(path);
+  }
+} catch (FileNotFoundException e) {
+  return null;
+}
+  }
+
+  private FileStatus[] listStatus(Path path) throws IOException {
+try {
+  if (fs != null) {
+return fs.listStatus(path);
+  } else {
+return fc.util().listStatus(path);
+  }
+} catch (FileNotFoundException e) {
+  return new FileStatus[0];
+}
+  }
+
+  private Path fixRelativePart(Path path) {
+if (fs != null) {
+  return fs.fixRelativePart(path);
+} else {
+  return fc.fixRelativePart(path);
+}
+  }
+
+  /**
+   * Convert a path component that contains backslash ecape sequences to a
+   * literal string.  This is necessary when you want to explicitly refer 
to a
+   * path that contains globber metacharacters.
+   */
+  private static String unescapePathComponent(String name) {
+return name.replaceAll("(.)", "$1");
+  }
+
+  /**
+   * Translate an absolute path into a list of path components.
+   * We merge double slashes into a single slash here.
+   * POSIX root path, i.e. '/', does not get an entry in the list.
+   */
+  private static List getPathComponents(String path)
+  throws IOException {
+ArrayList ret = new ArrayList();
+for (String component : path.split(Path.SEPARATOR)) {
+  if (!component.isEmpty()) {
+ret.add(component);
+  }
+}
+return ret;
+  }
+
+  private String schemeFromPath(Path path) throws IOException {
+String scheme = path.toUri().getScheme();
+if (scheme == null) {
+  if (fs != null) {
+scheme = fs.getUri().getScheme();
+  } else {
+scheme = 
fc.getFSofPath(fc.fixRelativePart(path)).getUri().getScheme();
+  }
+}
+return scheme;
+  }
+
+  private String authorityFromPath(Path path) throws IOException {
+String authority = path.toUri().getAuthority();
+if (authority == null) {
+  if (fs != null) {
+authority = fs.getUri().getAuthority();
+  } else {
+authority = 
fc.getFSofPath(fc.fixRelativePart(path)).getUri().getAuthority();
+  }
+}
+return authority ;
+  }
+
+  public FileStatus[] globWithThreshold(int threshold) throws IOException {
+// First we get the scheme and authority of the patter

[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22213
  
**[Test build #95820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95820/testReport)**
 for PR 22213 at commit 
[`603fdc9`](https://github.com/apache/spark/commit/603fdc9c148e2a36393140054c559f5a0a380e5d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22213: [SPARK-25221][DEPLOY] Consistent trailing whitespace tre...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22213
  
**[Test build #95819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95819/testReport)**
 for PR 22213 at commit 
[`d341dae`](https://github.com/apache/spark/commit/d341dae0511fd9456ab3e0bd686e11749d43aeb7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118822
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
--- End diff --

Can you describe a note about what's the difference from 
`InMemoryFileIndex.listLeafFiles`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118793
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
+   */
+  private def getGlobbedPaths(
--- End diff --

move this function into `SparkHadoopUtil?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118763
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -656,6 +656,25 @@ object SQLConf {
   .intConf
   .createWithDefault(1)
 
+  val PARALLEL_GET_GLOBBED_PATH_THRESHOLD =
+buildConf("spark.sql.sources.parallelGetGlobbedPath.threshold")
+  .doc("The maximum number of subfiles or directories allowed after a 
globbed path " +
+"expansion.")
+  .intConf
+  .checkValue(threshold => threshold >= 0, "The maximum number of 
subfiles or directories " +
--- End diff --

`internal`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22359
  
**[Test build #95818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95818/testReport)**
 for PR 22359 at commit 
[`71f382b`](https://github.com/apache/spark/commit/71f382bd7c9fdfda38b1bc8063b2a55dd56c00b4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22359
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...

2018-09-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22359
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2941/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118629
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -724,4 +726,37 @@ object DataSource extends Logging {
  """.stripMargin)
 }
   }
+
+  /**
+   * Return all paths represented by the wildcard string.
+   * This will be done in main thread by default while the value of config
+   * `spark.sql.sources.parallelGetGlobbedPath.numThreads` > 0, a local 
thread
+   * pool will expand the globbed paths.
+   */
+  private def getGlobbedPaths(
+  sparkSession: SparkSession,
--- End diff --

Can you drop `sparkSession`, then add `parallelGetGlobbedPathThreshold`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21618: [SPARK-20408][SQL] Get the glob path in parallel ...

2018-09-07 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21618#discussion_r216118583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1557,6 +1576,15 @@ class SQLConf extends Serializable with Logging {
   def parallelPartitionDiscoveryParallelism: Int =
 getConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_PARALLELISM)
 
+  def parallelGetGlobbedPathThreshold: Int =
+getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_THRESHOLD)
+
+  def parallelGetGlobbedPathNumThreads: Int =
+getConf(SQLConf.PARALLEL_GET_GLOBBED_PATH_NUM_THREADS)
+
+  def parallelGetGlobbedPathEnabled: Boolean =
--- End diff --

Plz move this into `DataSource`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22361: Revert [SPARK-10399] [SPARK-23879] [SPARK-23762] [SPARK-...

2018-09-07 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22361
  
@gatorsmile Overall, I agree with revert since performance degradation is 
confirmed.

When I run the TPC-DS in #19222, I have not seen such a performance 
regression as 
[here](https://github.com/apache/spark/pull/19222#issuecomment-378154854). For 
future analysis, I am very interested in 
* execution time of each query
* environment (machine, os, jvm, cluster or local, ...)
* scaling factor,
* # of repeats per query
* others

@npoggi and @winglungngai Thanks for your investigations. Would it be 
possible to share the avobe information?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveD...

2018-09-07 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22359#discussion_r216118310
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -803,6 +803,25 @@ class HiveDDLSuite
 }
   }
 
+  test("Insert overwrite directory should output correct schema") {
--- End diff --

In this PR, let's handle this test case only.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >