[GitHub] spark pull request #21107: [DO-NOT-MERGE][WIP] Explicitly print out skipped ...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21107#discussion_r183226283
  
--- Diff: python/run-tests.py ---
@@ -152,65 +172,17 @@ def parse_opts():
 return opts
 
 
-def _check_dependencies(python_exec, modules_to_test):
-if "COVERAGE_PROCESS_START" in os.environ:
-# Make sure if coverage is installed.
-try:
-subprocess_check_output(
-[python_exec, "-c", "import coverage"],
-stderr=open(os.devnull, 'w'))
-except:
-print_red("Coverage is not installed in Python executable '%s' 
"
-  "but 'COVERAGE_PROCESS_START' environment variable 
is set, "
-  "exiting." % python_exec)
-sys.exit(-1)
-
-# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are 
installed and
-# explicitly prints out. See SPARK-23300.
-if pyspark_sql in modules_to_test:
-# TODO(HyukjinKwon): Relocate and deduplicate these version 
specifications.
-minimum_pyarrow_version = '0.8.0'
--- End diff --

We are now relaying on the existing checks in the tests. For example:


https://github.com/apache/spark/blob/ab7b961a4fe96ca02b8352d16b0fa80c972b67fc/python/pyspark/sql/tests.py#L63-L69


https://github.com/apache/spark/blob/ab7b961a4fe96ca02b8352d16b0fa80c972b67fc/python/pyspark/sql/tests.py#L3121-L3123

which prints out a skip message like:

```
test_createDataFrame_does_not_modify_input 
(pyspark.sql.tests.ArrowTests) ... skipped 
'PyArrow >= 0.8.0 must be installed; however, it was not found.'
```

which I am capturing here with a regex pattern.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21107: [DO-NOT-MERGE][WIP] Explicitly print out skipped tests f...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21107
  
@BryanCutler, will check and update after testing out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20280
  
@BryanCutler, mind if I ask to clarify what happens for end-to-end cases in 
the PR description (like before & after with explaining the reasons)? the 
change looks small but possibly a breaking change about end-to-end cases 
although I think for now we are restoring the correct behaviour as expected.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21120: [SPARK-22448][ML] Added sum function to Summerizer and M...

2018-04-21 Thread dedunumax
Github user dedunumax commented on the issue:

https://github.com/apache/spark/pull/21120
  
cc @rxin @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20280
  
BTW, I believe it's not so easy to pass a configuration from a very quick 
look because the exception usually would be thrown in a Python worker process.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20280
  
If the renaming scenario works in most of cases as expected, I think it'd 
be worthwhile to have a configuration; however, the previous behaviour looks 
actually odd because it's going to work only in certain weird conditions when 
fields in `Row` and fields in the given schema are in the same alphabetical 
order (https://github.com/apache/spark/pull/20280#discussion_r182569705). 
Otherwise this case fails already as well.

The test case modified in 
https://github.com/apache/spark/pull/20280#discussion_r182569705 actually works 
only because `key` and `value` in `Row` and `a` and `b` in the schema are in 
the same order. I think the test case should be invalid .. 

I thought about this for a while and failed to describe what the 
configuration does .. It sounded describing a bug like it was a proper 
behaviour that can be controlled by a configuration ..

I think this one sounds more like a bug fix to me so far. Workaround should 
be relatively easy. Maybe, would it be good enough to describe workaround in 
the guide instead? I think it should be fine if we just use a map to convert 
`Row` to things like a tuple.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20930: [SPARK-23811][Core] FetchFailed comes before Success of ...

2018-04-21 Thread Ngone51
Github user Ngone51 commented on the issue:

https://github.com/apache/spark/pull/20930
  
> because we can get the MapStatus, but get a 'null'. If I'm not mistaken, 
this also because the ExecutorLost trigger removeOutputsOnExecutor

If there's a `null` MapStatus for stage 2, how can it retry 4 times without 
any tasks? IIUC, `null` MapStatus leads to missing partition, which means there 
will be some tasks to submit.

As for stage 3's shuffle Id, that's really weird. Hope you can fix it! 
@xuanyuanking 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21116: [SPARK-24038][SS] Refactor continuous writing to ...

2018-04-21 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21116#discussion_r183224838
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/WriteToContinuousDataSourceExec.scala
 ---
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.streaming.continuous
+
+import scala.util.control.NonFatal
+
+import org.apache.spark.{SparkEnv, SparkException, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.execution.SparkPlan
+import 
org.apache.spark.sql.execution.datasources.v2.{DataWritingSparkTask, 
InternalRowDataWriterFactory}
+import 
org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask.{logError, 
logInfo}
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import org.apache.spark.sql.sources.v2.writer._
+import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter
+import org.apache.spark.util.Utils
+
+/**
+ * The physical plan for writing data into a continuous processing 
[[StreamWriter]].
+ */
+case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: 
SparkPlan)
+extends SparkPlan with Logging {
+  override def children: Seq[SparkPlan] = Seq(query)
+  override def output: Seq[Attribute] = Nil
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val writerFactory = writer match {
+  case w: SupportsWriteInternalRow => 
w.createInternalRowWriterFactory()
+  case _ => new 
InternalRowDataWriterFactory(writer.createWriterFactory(), query.schema)
+}
+
+val rdd = query.execute()
+val messages = new Array[WriterCommitMessage](rdd.partitions.length)
+
+logInfo(s"Start processing data source writer: $writer. " +
+  s"The input RDD has ${messages.length} partitions.")
+// Let the epoch coordinator know how many partitions the write RDD 
has.
+EpochCoordinatorRef.get(
+
sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+sparkContext.env)
+  .askSync[Unit](SetWriterPartitions(rdd.getNumPartitions))
+
+try {
+  // Force the RDD to run so continuous processing starts; no data is 
actually being collected
+  // to the driver, as ContinuousWriteRDD outputs nothing.
+  sparkContext.runJob(
+rdd,
+(context: TaskContext, iter: Iterator[InternalRow]) =>
+  WriteToContinuousDataSourceExec.run(writerFactory, context, 
iter),
+rdd.partitions.indices)
+} catch {
+  case _: InterruptedException =>
+// Interruption is how continuous queries are ended, so accept and 
ignore the exception.
+  case cause: Throwable =>
+cause match {
+  // Do not wrap interruption exceptions that will be handled by 
streaming specially.
+  case _ if StreamExecution.isInterruptionException(cause) => 
throw cause
+  // Only wrap non fatal exceptions.
+  case NonFatal(e) => throw new SparkException("Writing job 
aborted.", e)
+  case _ => throw cause
+}
+}
+
+sparkContext.emptyRDD
+  }
+}
+
+object WriteToContinuousDataSourceExec extends Logging {
+  def run(
+  writeTask: DataWriterFactory[InternalRow],
+  context: TaskContext,
+  iter: Iterator[InternalRow]): Unit = {
+val epochCoordinator = EpochCoordinatorRef.get(
+  
context.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY),
+  SparkEnv.get)
+val currentMsg: WriterCommitMessage = null
--- End diff --

currentMsg is no longer needed?


---

-
To unsubscribe, 

[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20280
  
I'm kinda worry the example you give above is actually fairly common - 
construct with kwargs, and then (re-)name the columns.

perhaps worthwhile to consider a config switch?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark

2018-04-21 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21071
  
yap... HTrace is 
[retired](http://mail-archives.apache.org/mod_mbox/htrace-dev/201804.mbox/%3Cpony-b7497055821405926d63668ab1112e0f108e2346-2561e81afc434e2d237bbeb5b5921941503445e4%40dev.htrace.apache.org%3E).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20940: [SPARK-23429][CORE] Add executor memory metrics to heart...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20940
  
**[Test build #89685 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89685/testReport)**
 for PR 20940 at commit 
[`ae8a388`](https://github.com/apache/spark/commit/ae8a388405d8d3402b5b6a45a7c7855d90538edb).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20930: [SPARK-23811][Core] FetchFailed comes before Success of ...

2018-04-21 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/20930
  

![image](https://user-images.githubusercontent.com/4833765/39091106-ff11d0a6-461f-11e8-968f-7fcbe6652bb3.png)

Stage 0\1\2\3 same with 20\21\22\23 in this screenshot, stage2's shuffleId 
is 1 but stage3's is 0 can't happen.

Good description for the scenario, can't get a FetchFailed because we can 
get the MapStatus, but get a 'null'. If I'm not mistaken, this also because the 
ExecutorLost trigger `removeOutputsOnExecutor`.

Happy to discuss with all guys and sorry for can't giving more detailed log 
after checking the root case, this happened in Baidu online env and can't keep 
all logs for 1 month. I'll keep fixing the case and catching details log as 
mush as possible.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89682/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21052
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89684/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21052
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21082
  
**[Test build #89682 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89682/testReport)**
 for PR 21082 at commit 
[`657a6a5`](https://github.com/apache/spark/commit/657a6a5ababbf816db8bbd19475b8e3e5f4aa2ae).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21052
  
**[Test build #89684 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89684/testReport)**
 for PR 21052 at commit 
[`8369cbc`](https://github.com/apache/spark/commit/8369cbcd5eab3686c78365e1b1f906a3e8136731).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89681/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21122
  
**[Test build #89681 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89681/testReport)**
 for PR 21122 at commit 
[`c62bba1`](https://github.com/apache/spark/commit/c62bba1ed024c7d1d91da8f3d8035de8dc169302).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ExternalCatalog `
  * `  // Returns the underlying catalog class (e.g., HiveExternalCatalog).`
  * `class ExternalCatalogWithListener(delegate: ExternalCatalog)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20787: [MINOR][DOCS] Documenting months_between directio...

2018-04-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/20787#discussion_r183221673
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1117,11 +1117,21 @@ case class AddMonths(startDate: Expression, 
numMonths: Expression)
 }
 
 /**
- * Returns number of months between dates date1 and date2.
+ * Returns number of months between dates `timestamp1` and `timestamp2`.
+ * If `timestamp` is later than `timestamp2`, then the result is positive.
--- End diff --

Nit: timestamp -> timestamp1. Same below.
Nit: These are called date1 and date2 in Python, and also here in the Scala 
code. Worth being consistent?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21121
  
**[Test build #89683 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89683/testReport)**
 for PR 21121 at commit 
[`a599544`](https://github.com/apache/spark/commit/a599544b134d5c14936d76d607466adf1529370e).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89683/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21052
  
**[Test build #89684 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89684/testReport)**
 for PR 21052 at commit 
[`8369cbc`](https://github.com/apache/spark/commit/8369cbcd5eab3686c78365e1b1f906a3e8136731).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21121
  
**[Test build #89683 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89683/testReport)**
 for PR 21121 at commit 
[`a599544`](https://github.com/apache/spark/commit/a599544b134d5c14936d76d607466adf1529370e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread mn-mikke
Github user mn-mikke commented on a diff in the pull request:

https://github.com/apache/spark/pull/21121#discussion_r183220685
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -883,3 +884,139 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 
   override def sql: String = s"concat(${children.map(_.sql).mkString(", 
")})"
 }
+
+/**
+ * Returns the maximum value in the array.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(array[, indexFirst]) - Transforms the input array by 
encapsulating elements into pairs with indexes indicating the order.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array("d", "a", null, "b"));
+   [("d",0),("a",1),(null,2),("b",3)]
+  > SELECT _FUNC_(array("d", "a", null, "b"), true);
+   [(0,"d"),(1,"a"),(2,null),(3,"b")]
+  """,
+  since = "2.4.0")
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread mshtelma
Github user mshtelma commented on the issue:

https://github.com/apache/spark/pull/21052
  
@gatorsmile I have removed explain() and changed formatting


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...

2018-04-21 Thread mshtelma
Github user mshtelma commented on a diff in the pull request:

https://github.com/apache/spark/pull/21052#discussion_r183220650
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("Simple queries must be working, if CBO is turned on") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("TBL1", "TBL") {
+import org.apache.spark.sql.functions._
+val df = spark.range(1000L).select('id,
+  'id * 2 as "FLD1",
+  'id * 12 as "FLD2",
+  lit("aaa") + 'id as "fld3")
+df.write
+  .mode(SaveMode.Overwrite)
+  .bucketBy(10, "id", "FLD1", "FLD2")
+  .sortBy("id", "FLD1", "FLD2")
+  .saveAsTable("TBL")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, 
FLD2, FLD3")
+val df2 = spark.sql(
+  """
+ SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
+ FROM tbl t1
+ JOIN tbl t2 on t1.id=t2.id
+ WHERE  t1.fld3 IN (-123.23,321.23)
+  """.stripMargin)
+df2.createTempView("TBL2")
+sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')  ").explain()
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...

2018-04-21 Thread mshtelma
Github user mshtelma commented on a diff in the pull request:

https://github.com/apache/spark/pull/21052#discussion_r183220647
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("Simple queries must be working, if CBO is turned on") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("TBL1", "TBL") {
+import org.apache.spark.sql.functions._
+val df = spark.range(1000L).select('id,
+  'id * 2 as "FLD1",
+  'id * 12 as "FLD2",
+  lit("aaa") + 'id as "fld3")
+df.write
+  .mode(SaveMode.Overwrite)
+  .bucketBy(10, "id", "FLD1", "FLD2")
+  .sortBy("id", "FLD1", "FLD2")
+  .saveAsTable("TBL")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, 
FLD2, FLD3")
+val df2 = spark.sql(
+  """
+ SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
+ FROM tbl t1
+ JOIN tbl t2 on t1.id=t2.id
+ WHERE  t1.fld3 IN (-123.23,321.23)
+  """.stripMargin)
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21082
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2563/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21082: [SPARK-22239][SQL][Python] Enable grouped aggregate pand...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21082
  
**[Test build #89682 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89682/testReport)**
 for PR 21082 at commit 
[`657a6a5`](https://github.com/apache/spark/commit/657a6a5ababbf816db8bbd19475b8e3e5f4aa2ae).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-21 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r183220435
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -5156,6 +5156,15 @@ def test_retain_group_columns(self):
 expected1 = df.groupby(df.id).agg(sum(df.v))
 self.assertPandasEqual(expected1.toPandas(), 
result1.toPandas())
 
+def test_array_type(self):
--- End diff --

This is related, but I figured its shouldn't hurt to add an array test in 
GroupedAggPandasUDFTests..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21082: [SPARK-22239][SQL][Python] Enable grouped aggrega...

2018-04-21 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/21082#discussion_r183220392
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala
 ---
@@ -149,7 +149,7 @@ class AnalysisErrorSuite extends AnalysisTest {
   UnresolvedAttribute("a") :: Nil,
   SortOrder(UnresolvedAttribute("b"), Ascending) :: Nil,
   UnspecifiedFrame)).as('window)),
-"not supported within a window function" :: Nil)
+"does not have any window functions" :: Nil)
--- End diff --

This is because an early analysis exception is thrown by rule 
ExtractWindowExpressions


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21052
  
LGTM except two minor comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21052#discussion_r183219812
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("Simple queries must be working, if CBO is turned on") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("TBL1", "TBL") {
+import org.apache.spark.sql.functions._
+val df = spark.range(1000L).select('id,
+  'id * 2 as "FLD1",
+  'id * 12 as "FLD2",
+  lit("aaa") + 'id as "fld3")
+df.write
+  .mode(SaveMode.Overwrite)
+  .bucketBy(10, "id", "FLD1", "FLD2")
+  .sortBy("id", "FLD1", "FLD2")
+  .saveAsTable("TBL")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, 
FLD2, FLD3")
+val df2 = spark.sql(
+  """
+ SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
+ FROM tbl t1
+ JOIN tbl t2 on t1.id=t2.id
+ WHERE  t1.fld3 IN (-123.23,321.23)
+  """.stripMargin)
--- End diff --

Nit:
```Scala
  """
|SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
|FROM tbl t1
|JOIN tbl t2 on t1.id=t2.id
|WHERE  t1.fld3 IN (-123.23,321.23)
  """.stripMargin)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21052#discussion_r183219803
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -382,4 +382,32 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("Simple queries must be working, if CBO is turned on") {
+withSQLConf(SQLConf.CBO_ENABLED.key -> "true") {
+  withTable("TBL1", "TBL") {
+import org.apache.spark.sql.functions._
+val df = spark.range(1000L).select('id,
+  'id * 2 as "FLD1",
+  'id * 12 as "FLD2",
+  lit("aaa") + 'id as "fld3")
+df.write
+  .mode(SaveMode.Overwrite)
+  .bucketBy(10, "id", "FLD1", "FLD2")
+  .sortBy("id", "FLD1", "FLD2")
+  .saveAsTable("TBL")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
+sql("ANALYZE TABLE TBL COMPUTE STATISTICS FOR COLUMNS ID, FLD1, 
FLD2, FLD3")
+val df2 = spark.sql(
+  """
+ SELECT t1.id, t1.fld1, t1.fld2, t1.fld3
+ FROM tbl t1
+ JOIN tbl t2 on t1.id=t2.id
+ WHERE  t1.fld3 IN (-123.23,321.23)
+  """.stripMargin)
+df2.createTempView("TBL2")
+sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 'qwe')  ").explain()
--- End diff --

Please do not use `explain()`. It will output the strings to the console. 
You can just do this:
```
sql("SELECT * FROM tbl2 WHERE fld3 IN ('qqq', 
'qwe')").queryExecution.executedPlan
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2562/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21122
  
**[Test build #89681 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89681/testReport)**
 for PR 21122 at commit 
[`c62bba1`](https://github.com/apache/spark/commit/c62bba1ed024c7d1d91da8f3d8035de8dc169302).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21122
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89680/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21122
  
**[Test build #89680 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89680/testReport)**
 for PR 21122 at commit 
[`c62bba1`](https://github.com/apache/spark/commit/c62bba1ed024c7d1d91da8f3d8035de8dc169302).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait ExternalCatalog `
  * `  // Returns the underlying catalog class (e.g., HiveExternalCatalog).`
  * `class ExternalCatalogWithListener(delegate: ExternalCatalog)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12154: [SPARK-12133][STREAMING] Streaming dynamic allocation

2018-04-21 Thread sugix
Github user sugix commented on the issue:

https://github.com/apache/spark/pull/12154
  
@tdas - Why we cannot see this in the documentation and I am not sure if 
AWS EMR supports this feature? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89679/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21121
  
**[Test build #89679 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89679/testReport)**
 for PR 21121 at commit 
[`551d04d`](https://github.com/apache/spark/commit/551d04d672686339af3dc5a26b6669a3e996d763).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread mn-mikke
Github user mn-mikke commented on the issue:

https://github.com/apache/spark/pull/21121
  
@gatorsmile I'm not aware of any. From user experience, I strongly feel 
that such a function is missing. Escpecially, when 
[transform](https://issues.apache.org/jira/browse/SPARK-23908) function is 
introduced.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89678/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89678 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89678/testReport)**
 for PR 21056 at commit 
[`fdeac84`](https://github.com/apache/spark/commit/fdeac84f5b6fe2e25b32cbed4d1771e7c85887cc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2561/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21122
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21115: [SPARK-24033] [SQL] Fix Mismatched of Window Fram...

2018-04-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21115


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21122
  
**[Test build #89680 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89680/testReport)**
 for PR 21122 at commit 
[`c62bba1`](https://github.com/apache/spark/commit/c62bba1ed024c7d1d91da8f3d8035de8dc169302).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21115: [SPARK-24033] [SQL] Fix Mismatched of Window Frame speci...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21115
  
Thanks! Merged to master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-04-21 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/21122

[SPARK-24017] [SQL] Refactor ExternalCatalog to be an interface

## What changes were proposed in this pull request?
This refactors the external catalog to be an interface. It can be easier 
for the future work in the catalog federation. After the refactoring, 
`ExternalCatalog` is much cleaner without mixing the listener event generation 
logic.  

## How was this patch tested?
The existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark refactorExternalCatalog

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21122.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21122


commit c62bba1ed024c7d1d91da8f3d8035de8dc169302
Author: gatorsmile 
Date:   2018-04-21T17:36:20Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21122
  
cc @rxin @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21121#discussion_r183214860
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -883,3 +884,139 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 
   override def sql: String = s"concat(${children.map(_.sql).mkString(", 
")})"
 }
+
+/**
+ * Returns the maximum value in the array.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(array[, indexFirst]) - Transforms the input array by 
encapsulating elements into pairs with indexes indicating the order.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array("d", "a", null, "b"));
+   [("d",0),("a",1),(null,2),("b",3)]
+  > SELECT _FUNC_(array("d", "a", null, "b"), true);
+   [(0,"d"),(1,"a"),(2,null),(3,"b")]
+  """,
+  since = "2.4.0")
+case class ZipWithIndex(child: Expression, indexFirst: Expression)
+  extends UnaryExpression with ExpectsInputTypes {
+
+  def this(e: Expression) = this(e, Literal.FalseLiteral)
+
+  val indexFirstValue: Boolean = indexFirst match {
+case Literal(v: Boolean, BooleanType) => v
+case _ => throw new AnalysisException("The second argument has to be a 
boolean constant.")
+  }
+
+  private val MAX_ARRAY_LENGTH: Int = 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(ArrayType)
+
+  lazy val childArrayType: ArrayType = 
child.dataType.asInstanceOf[ArrayType]
+
+  override def dataType: DataType = {
+val elementField = StructField("value", childArrayType.elementType, 
childArrayType.containsNull)
+val indexField = StructField("index", IntegerType, false)
+
+val fields = if (indexFirstValue) Seq(indexField, elementField) else 
Seq(elementField, indexField)
+
+ArrayType(StructType(fields), false)
+  }
+
+  override protected def nullSafeEval(input: Any): Any = {
+val array = 
input.asInstanceOf[ArrayData].toObjectArray(childArrayType.elementType)
+
+val makeStruct = (v: Any, i: Int) => if (indexFirstValue) 
InternalRow(i, v) else InternalRow(v, i)
+val resultData = array.zipWithIndex.map{case (v, i) => makeStruct(v, 
i)}
+
+new GenericArrayData(resultData)
+  }
+
+  override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): 
ExprCode = {
+nullSafeCodeGen(ctx, ev, c => {
+  if (CodeGenerator.isPrimitiveType(childArrayType.elementType)) {
+genCodeForPrimitiveElements(ctx, c, ev.value)
+  } else {
+genCodeForNonPrimitiveElements(ctx, c, ev.value)
+  }
+})
+  }
+
+  private def genCodeForPrimitiveElements(
+  ctx: CodegenContext,
+  childVariableName: String,
+  arrayData: String): String = {
+val numElements = ctx.freshName("numElements")
+val byteArraySize = ctx.freshName("byteArraySize")
+val data = ctx.freshName("byteArray")
+val unsafeRow = ctx.freshName("unsafeRow")
+val structSize = ctx.freshName("structSize")
+val unsafeArrayData = ctx.freshName("unsafeArrayData")
+val structsOffset = ctx.freshName("structsOffset")
+val calculateArraySize = 
"UnsafeArrayData.calculateSizeOfUnderlyingByteArray"
+val calculateHeader = "UnsafeArrayData.calculateHeaderPortionInBytes"
+
+val baseOffset = Platform.BYTE_ARRAY_OFFSET
+val longSize = LongType.defaultSize
+val primitiveValueTypeName = 
CodeGenerator.primitiveTypeName(childArrayType.elementType)
+val valuePosition = if (indexFirstValue) "1" else "0"
+val indexPosition = if (indexFirstValue) "0" else "1"
--- End diff --

nit: How about `val (valuePosition, indexPosition) = if (indexFirstValue) 
("1", "0") else ("0", "1")`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/21121
  
Which database has this function?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20959
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89677/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20959
  
**[Test build #89677 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89677/testReport)**
 for PR 20959 at commit 
[`0737bf7`](https://github.com/apache/spark/commit/0737bf7717f6b1f253c9d78013065e7147279607).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21121#discussion_r183214185
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 ---
@@ -883,3 +884,139 @@ case class Concat(children: Seq[Expression]) extends 
Expression {
 
   override def sql: String = s"concat(${children.map(_.sql).mkString(", 
")})"
 }
+
+/**
+ * Returns the maximum value in the array.
+ */
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(array[, indexFirst]) - Transforms the input array by 
encapsulating elements into pairs with indexes indicating the order.",
+  examples = """
+Examples:
+  > SELECT _FUNC_(array("d", "a", null, "b"));
+   [("d",0),("a",1),(null,2),("b",3)]
+  > SELECT _FUNC_(array("d", "a", null, "b"), true);
+   [(0,"d"),(1,"a"),(2,null),(3,"b")]
+  """,
+  since = "2.4.0")
--- End diff --

nit: `// scalastyle:on line.size.limit`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21121#discussion_r183214315
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -3340,6 +3340,17 @@ object functions {
*/
   def reverse(e: Column): Column = withExpr { Reverse(e.expr) }
 
+  /**
+   * Transforms the input array by encapsulating elements into pairs
+   * with indexes indicating the order.
+   *
+   * @group collection_funcs
+   * @since 2.4.0
+   */
+  def zip_with_index(e: Column, indexFirst: Boolean = false): Column = 
withExpr {
--- End diff --

Let's avoid using a default value in APIs. It doesn't work in Java.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21121#discussion_r183214167
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2191,6 +2191,24 @@ def reverse(col):
 return Column(sc._jvm.functions.reverse(_to_java_column(col)))
 
 
+@since(2.4)
+def zip_with_index(col, indexFirst=False):
+"""
+Collection function: transforms the input array by encapsulating 
elements into pairs
+with indexes indicating the order.
+
+:param col: name of column or expression
+
+>>> df = spark.createDataFrame([([2, 5, 3],), ([],)], ['data'])
+>>> df.select(zip_with_index(df.data).alias('r')).collect()
+[Row(r=[[value=2, index=0], [value=5, index=1], [value=3, index=2]]), 
Row(r=[])]
+>>> df.select(zip_with_index(df.data, 
indexFirst=True).alias('r')).collect()
+[Row(r=[[index=0, value=2], [index=1, value=5], [index=2, value=3]]), 
Row(r=[])]
+ """
--- End diff --

nit: there's one more leading space here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20280
  
Right. Will triple check for sure but I am with you for now. Yup, something 
in the migration guide makes much more sense to me too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21121
  
**[Test build #89679 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89679/testReport)**
 for PR 21121 at commit 
[`551d04d`](https://github.com/apache/spark/commit/551d04d672686339af3dc5a26b6669a3e996d763).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21121
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21110: [SPARK-24029][core] Set SO_REUSEADDR on listen so...

2018-04-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21110


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21052
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21052
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89675/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21110: [SPARK-24029][core] Set SO_REUSEADDR on listen sockets.

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/21110
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21052
  
**[Test build #89675 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89675/testReport)**
 for PR 21052 at commit 
[`8d21488`](https://github.com/apache/spark/commit/8d2148814e52a2db1e14592c91467013565c310a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89674/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21056
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89674 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89674/testReport)**
 for PR 21056 at commit 
[`f96134c`](https://github.com/apache/spark/commit/f96134c39adf643148c87f9bf7f0d5340b0219a3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89678 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89678/testReport)**
 for PR 21056 at commit 
[`fdeac84`](https://github.com/apache/spark/commit/fdeac84f5b6fe2e25b32cbed4d1771e7c85887cc).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20959: [SPARK-23846][SQL] The samplingRatio option for CSV data...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20959
  
**[Test build #89677 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89677/testReport)**
 for PR 20959 at commit 
[`0737bf7`](https://github.com/apache/spark/commit/0737bf7717f6b1f253c9d78013065e7147279607).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20280: [SPARK-22232][PYTHON][SQL] Fixed Row pickling to include...

2018-04-21 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20280
  
oops, I missed this. will take a look shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread mn-mikke
Github user mn-mikke commented on the issue:

https://github.com/apache/spark/pull/21121
  
cc @gatorsmile @ueshin @kiszk


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20930: [SPARK-23811][Core] FetchFailed comes before Success of ...

2018-04-21 Thread Ngone51
Github user Ngone51 commented on the issue:

https://github.com/apache/spark/pull/20930
  
Hi, @xuanyuanking , thank for your patient explanation, sincerely.

With regard to your latest explanation:
 
> stage 2's shuffleID is 1, but stage 3 failed by missing an output for 
shuffle '0'! So here the stage 2's skip cause stage 3 got an error shuffleId.

However, I don't think stage 2's skip will lead to stage 3 got an error 
shuffleId, as we've already created all `ShuffleDependencies ` (constructed 
with certain ids) for `ShuffleMapStages` before any stages of a job submitted. 

As I struggle for understanding this issue for a while,  finally, I got my 
own inference:

(assume the 2 ShuffleMapTasks below is belong to stage 2, and stage 2 has 
two partitions on map side. And stage 2 has a parent stage named stage 1, and a 
child stage named stage 3.)

1. ShuffleMapTask 0.0 run on ExecutorB,  and write map output on ExecutorB, 
 succeed normally.
And now, there's only '1' available map output registered on 
`MapOutputTrackerMaster `.

2. ShuffleMapTask 1.0 is running on ExecutorA, and fetch data from 
ExecutorA, and write map output on ExecutorA, too.

3. ExecutorA lost for unknown reason after send `StatusUpdate` message to 
driver, which tells ShuffleMapTask 1.0's success. And all map outputs on 
ExecutorA lost, include ShuffleMapTask 1.0's map output.

4. And driver launch a speculative ShuffleMapTask 1.1 before it receives 
the `StatusUpdate` message. And ShuffleMapTask 1.1 get FetchFailed immediately.

5. `DAGScheduler` handle the FetchFailed ShuffleMapTask 1.1 firstly, mark 
stage 2 and it's parent stage 1 as failed. And stage 1 & stage 2 are waiting 
for resubmit.

6. `DAGScheduler ` handle the success ShuffleMapTask 1.0 before stage 1 & 
stage 2 resubmit, which trigger `MapOutputTrackerMaster.registerMapOutput` . 
And now, there's '2' available map output registered on `MapOutputTrackerMaster 
` (but knowing ShuffleMapTask 1.0's map output on ExecutorA has been lost.).

7. stage 1 resubmitted and succeed normally.

8. stage 2 resubmitted. As stage 2 has '2' available map output registered 
on `MapOutputTrackerMaster `, so there's no missing partitions for stage 2. 
Thus, stage 2 has no missing tasks to submit, too. 

9. And then, we submit stage 3. As stage 2's map output file lost on 
ExecutorA, so stage 3 must get a FetchFailed at the end. Then, we resubmit 
stage 2& stage 3.  And then we get into a loop until stag 3 abort.

But if the issue is what I described above, we should get 
`FetchFailedException` instead of `MetadataFetchFailedException`  shown in 
screenshot.  So, at this point which can not make sense. 

Please feel free to point my wrong spot out.

Anyway, thanks again.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21121: [SPARK-24042][SQL] Collection function: zip_with_index

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21121
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21121: [SPARK-24042][SQL] Collection function: zip_with_...

2018-04-21 Thread mn-mikke
GitHub user mn-mikke opened a pull request:

https://github.com/apache/spark/pull/21121

[SPARK-24042][SQL] Collection function: zip_with_index

## What changes were proposed in this pull request?

Implement function zip_with_index(array[, indexFirst]) that transforms the 
input array by encapsulating elements into pairs with indexes indicating the 
order.

```
zip_with_index(array("d", "a", null, "b")) => 
[("d",0),("a",1),(null,2),("b",3)]
zip_with_index(array("d", "a", null, "b"), true) => 
[(0,"d"),(1,"a"),(2,null),(3,"b")]
```

## How was this patch tested?

New tests added into:
- CollectionExpressionSuite
- DataFrameFunctionsSuite

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/AbsaOSS/spark 
feature/array-api-zip_with_index-to-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21121


commit 9f090309b8d13e37efaf7824b6d960a6f61ca79f
Author: mn-mikke 
Date:   2018-04-18T08:00:27Z

[SPARK-24042][SQL] Collection function: zip_with_index




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20350: [SPARK-23179][SQL] Support option to throw exception if ...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20350
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20350: [SPARK-23179][SQL] Support option to throw exception if ...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20350
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89673/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20350: [SPARK-23179][SQL] Support option to throw exception if ...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20350
  
**[Test build #89673 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89673/testReport)**
 for PR 20350 at commit 
[`aa84034`](https://github.com/apache/spark/commit/aa84034bd60413057738500564a9714dfa4b4192).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20997
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20997
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89676/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20997
  
**[Test build #89676 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89676/testReport)**
 for PR 20997 at commit 
[`2c45388`](https://github.com/apache/spark/commit/2c453883869921c99024c02f0a29aac395c82341).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-21 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20997
  
In the meantime found a small glitch in the SQL part. Namely if reattempt 
happens this line

https://github.com/apache/spark/blob/1d758dc73b54e802fdc92be204185fe7414e6553/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L445
removes the consumer from cache which will end up in this log message:

```
13:27:07.556 INFO org.apache.spark.sql.kafka010.KafkaDataConsumer: Released 
a supposedly cached consumer that was not found in the cache
```

I've solved this here by removing only the closed consumer. The marked for 
close will be removed in `release`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20997: [SPARK-19185] [DSTREAMS] Avoid concurrent use of cached ...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20997
  
**[Test build #89676 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89676/testReport)**
 for PR 20997 at commit 
[`2c45388`](https://github.com/apache/spark/commit/2c453883869921c99024c02f0a29aac395c82341).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21052
  
**[Test build #89675 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89675/testReport)**
 for PR 21052 at commit 
[`8d21488`](https://github.com/apache/spark/commit/8d2148814e52a2db1e14592c91467013565c310a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet produc...

2018-04-21 Thread mshtelma
Github user mshtelma commented on the issue:

https://github.com/apache/spark/pull/21052
  
@maropu thank you for the suggestions! I have implemented them and pushed 
the changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21056: [SPARK-23849][SQL] Tests for samplingRatio of json datas...

2018-04-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21056
  
**[Test build #89674 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89674/testReport)**
 for PR 21056 at commit 
[`f96134c`](https://github.com/apache/spark/commit/f96134c39adf643148c87f9bf7f0d5340b0219a3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21052: [SPARK-23799][SQL] FilterEstimation.evaluateInSet...

2018-04-21 Thread mshtelma
Github user mshtelma commented on a diff in the pull request:

https://github.com/apache/spark/pull/21052#discussion_r183206908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala ---
@@ -382,4 +382,34 @@ class StatisticsCollectionSuite extends 
StatisticsCollectionTestBase with Shared
   }
 }
   }
+
+  test("Simple queries must be working, if CBO is turned on") {
+withSQLConf(("spark.sql.cbo.enabled", "true")) {
+  withTable("TBL1", "TBL") {
+import org.apache.spark.sql.functions._
+val df = spark.range(1000L).select('id,
+  'id * 2 as "FLD1",
+  'id * 12 as "FLD2",
+  lit("aaa") + 'id as "fld3")
+df.write
+  .mode(SaveMode.Overwrite)
+  .bucketBy(10, "id", "FLD1", "FLD2")
+  .sortBy("id", "FLD1", "FLD2")
+  .saveAsTable("TBL")
+spark.sql("ANALYZE TABLE TBL COMPUTE STATISTICS ")
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >