[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19338
  
**[Test build #82220 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82220/testReport)**
 for PR 19338 at commit 
[`9a2d7ba`](https://github.com/apache/spark/commit/9a2d7ba342481824081951087aaf5281858898c0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19338
  
There's one related test failure, can you please check.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19337
  
**[Test build #82219 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82219/testReport)**
 for PR 19337 at commit 
[`29f05c7`](https://github.com/apache/spark/commit/29f05c7e37fe7d6f5dcf62bce49a4820137dcc1b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19349
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82214/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19359
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82216/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19359
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19349
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19349
  
**[Test build #82214 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82214/testReport)**
 for PR 19349 at commit 
[`7f6e43f`](https://github.com/apache/spark/commit/7f6e43f7b291b0e5f60914bcbf6a1a7dee92820f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19359
  
**[Test build #82216 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82216/testReport)**
 for PR 19359 at commit 
[`5537787`](https://github.com/apache/spark/commit/553778785891cc973a28dc05e57414eb759d8f66).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19338
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19338
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82217/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19338
  
**[Test build #82217 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82217/testReport)**
 for PR 19338 at commit 
[`05adc2a`](https://github.com/apache/spark/commit/05adc2adf1d87a1a3044903bdd5d54dd1e39adc7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141246604
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
 ---
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io._
+import java.net._
+import java.util.concurrent.atomic.AtomicBoolean
+
+import scala.collection.JavaConverters._
+
+import org.apache.arrow.vector.VectorSchemaRoot
+import org.apache.arrow.vector.stream.{ArrowStreamReader, 
ArrowStreamWriter}
+
+import org.apache.spark._
+import org.apache.spark.api.python._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
+import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, 
ColumnarBatch, ColumnVector}
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * Similar to `PythonUDFRunner`, but exchange data with Python worker via 
Arrow stream.
+ */
+class ArrowPythonRunner(
+funcs: Seq[ChainedPythonFunctions],
+batchSize: Int,
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]],
+schema: StructType)
+  extends BasePythonRunner[InternalRow, ColumnarBatch](
+funcs, bufferSize, reuseWorker, evalType, argOffsets) {
+
+  protected override def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[InternalRow],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread = {
+new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+  override def writeCommand(dataOut: DataOutputStream): Unit = {
+dataOut.writeInt(funcs.length)
+funcs.zip(argOffsets).foreach { case (chained, offsets) =>
+  dataOut.writeInt(offsets.length)
+  offsets.foreach { offset =>
+dataOut.writeInt(offset)
+  }
+  dataOut.writeInt(chained.funcs.length)
+  chained.funcs.foreach { f =>
+dataOut.writeInt(f.command.length)
+dataOut.write(f.command)
+  }
+}
+  }
+
+  override def writeIteratorToStream(dataOut: DataOutputStream): Unit 
= {
+val arrowSchema = ArrowUtils.toArrowSchema(schema)
+val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+  s"stdout writer for $pythonExec", 0, Long.MaxValue)
+
+val root = VectorSchemaRoot.create(arrowSchema, allocator)
+val arrowWriter = ArrowWriter.create(root)
+
+var closed = false
+
+context.addTaskCompletionListener { _ =>
+  if (!closed) {
+root.close()
+allocator.close()
+  }
+}
+
+val writer = new ArrowStreamWriter(root, null, dataOut)
+writer.start()
+
+Utils.tryWithSafeFinally {
+  while (inputIterator.hasNext) {
+var rowCount = 0
+while (inputIterator.hasNext && (batchSize <= 0 || rowCount < 
batchSize)) {
+  val row = inputIterator.next()
+  arrowWriter.write(row)
+  rowCount += 1
+}
+arrowWriter.finish()
+writer.writeBatch()
+arrowWriter.reset()
+  }
+} {
+  writer.end()
+  root.close()
+  allocator.close()
+  closed = true
+}
+  }
+}
+  }
+
+  protected override def newReaderIterator(
+  stream: DataInputStream,
+  writerThread: WriterThread,
+  startTime: Long,
+  env: SparkEnv,
+  worker: Socket,
+  released: AtomicBoolean,
+  context: 

[GitHub] spark pull request #19346: [SAPRK-20785][WEB-UI][SQL] Spark should provide j...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19346#discussion_r141246595
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala
 ---
@@ -61,7 +61,36 @@ private[ui] class AllExecutionsPage(parent: SQLTab) 
extends WebUIPage("") with L
   
details.parentNode.querySelector('.stage-details').classList.toggle('collapsed')
 }}
   
-UIUtils.headerSparkPage("SQL", content, parent, Some(5000))
+val summary: NodeSeq =
+  
+
+  {
+if (listener.getRunningExecutions.nonEmpty) {
+  
+Running 
Queries:
+{listener.getRunningExecutions.size}
+  
+}
+  }
+  {
+if (listener.getCompletedExecutions.nonEmpty) {
+  
--- End diff --

I think `completed-summary` is for ui test on jobs and stages pages. We may 
no need to use it here if no test for it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141244651
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.api.python
+
+import java.io._
+import java.net._
+import java.nio.charset.StandardCharsets
+import java.util.concurrent.atomic.AtomicBoolean
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark._
+import org.apache.spark.internal.Logging
+import org.apache.spark.util._
+
+
+/**
+ * Enumerate the type of command that will be sent to the Python worker
+ */
+private[spark] object PythonEvalType {
+  val NON_UDF = 0
+  val SQL_BATCHED_UDF = 1
+  val SQL_PANDAS_UDF = 2
+}
+
+/**
+ * A helper class to run Python mapPartition/UDFs in Spark.
+ *
+ * funcs is a list of independent Python functions, each one of them is a 
list of chained Python
+ * functions (from bottom to top).
+ */
+private[spark] abstract class BasePythonRunner[IN, OUT](
+funcs: Seq[ChainedPythonFunctions],
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]])
+  extends Logging {
+
+  require(funcs.length == argOffsets.length, "argOffsets should have the 
same length as funcs")
+
+  // All the Python functions should have the same exec, version and 
envvars.
+  protected val envVars = funcs.head.funcs.head.envVars
+  protected val pythonExec = funcs.head.funcs.head.pythonExec
+  protected val pythonVer = funcs.head.funcs.head.pythonVer
+
+  // TODO: support accumulator in multiple UDF
+  protected val accumulator = funcs.head.funcs.head.accumulator
+
+  def compute(
+  inputIterator: Iterator[IN],
+  partitionIndex: Int,
+  context: TaskContext): Iterator[OUT] = {
+val startTime = System.currentTimeMillis
+val env = SparkEnv.get
+val localdir = env.blockManager.diskBlockManager.localDirs.map(f => 
f.getPath()).mkString(",")
+envVars.put("SPARK_LOCAL_DIRS", localdir) // it's also used in monitor 
thread
+if (reuseWorker) {
+  envVars.put("SPARK_REUSE_WORKER", "1")
+}
+val worker: Socket = env.createPythonWorker(pythonExec, 
envVars.asScala.toMap)
+// Whether is the worker released into idle pool
+val released = new AtomicBoolean(false)
+
+// Start a thread to feed the process input from our parent's iterator
+val writerThread = newWriterThread(env, worker, inputIterator, 
partitionIndex, context)
+
+context.addTaskCompletionListener { context =>
+  writerThread.shutdownOnTaskCompletion()
+  if (!reuseWorker || !released.get) {
+try {
+  worker.close()
+} catch {
+  case e: Exception =>
+logWarning("Failed to close worker socket", e)
+}
+  }
+}
+
+writerThread.start()
+new MonitorThread(env, worker, context).start()
+
+// Return an iterator that read lines from the process's stdout
+val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))
+
+val stdoutIterator = newReaderIterator(
+  stream, writerThread, startTime, env, worker, released, context)
+new InterruptibleIterator(context, stdoutIterator)
+  }
+
+  protected def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[IN],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread
+
+  protected def newReaderIterator(
+  stream: DataInputStream,
+  writerThread: WriterThread,
+  startTime: Long,
+  env: SparkEnv,
+  worker: Socket,
+  released: AtomicBoolean,
+  context: TaskContext): Iterator[OUT]
+
 

[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19338
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82215/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19338
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19338
  
**[Test build #82215 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82215/testReport)**
 for PR 19338 at commit 
[`4d906b7`](https://github.com/apache/spark/commit/4d906b772ba1d893455d5b5676a898d81f035ad5).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141243843
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
 ---
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io._
+import java.net._
+import java.util.concurrent.atomic.AtomicBoolean
+
+import scala.collection.JavaConverters._
+
+import org.apache.arrow.vector.VectorSchemaRoot
+import org.apache.arrow.vector.stream.{ArrowStreamReader, 
ArrowStreamWriter}
+
+import org.apache.spark._
+import org.apache.spark.api.python._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
+import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, 
ColumnarBatch, ColumnVector}
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * Similar to `PythonUDFRunner`, but exchange data with Python worker via 
Arrow stream.
+ */
+class ArrowPythonRunner(
+funcs: Seq[ChainedPythonFunctions],
+batchSize: Int,
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]],
+schema: StructType)
+  extends BasePythonRunner[InternalRow, ColumnarBatch](
+funcs, bufferSize, reuseWorker, evalType, argOffsets) {
+
+  protected override def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[InternalRow],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread = {
+new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+  override def writeCommand(dataOut: DataOutputStream): Unit = {
+dataOut.writeInt(funcs.length)
+funcs.zip(argOffsets).foreach { case (chained, offsets) =>
+  dataOut.writeInt(offsets.length)
+  offsets.foreach { offset =>
+dataOut.writeInt(offset)
+  }
+  dataOut.writeInt(chained.funcs.length)
+  chained.funcs.foreach { f =>
+dataOut.writeInt(f.command.length)
+dataOut.write(f.command)
+  }
+}
+  }
+
+  override def writeIteratorToStream(dataOut: DataOutputStream): Unit 
= {
+val arrowSchema = ArrowUtils.toArrowSchema(schema)
+val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+  s"stdout writer for $pythonExec", 0, Long.MaxValue)
+
+val root = VectorSchemaRoot.create(arrowSchema, allocator)
+val arrowWriter = ArrowWriter.create(root)
+
+var closed = false
+
+context.addTaskCompletionListener { _ =>
+  if (!closed) {
+root.close()
+allocator.close()
+  }
+}
+
+val writer = new ArrowStreamWriter(root, null, dataOut)
+writer.start()
+
+Utils.tryWithSafeFinally {
+  while (inputIterator.hasNext) {
+var rowCount = 0
+while (inputIterator.hasNext && (batchSize <= 0 || rowCount < 
batchSize)) {
+  val row = inputIterator.next()
+  arrowWriter.write(row)
+  rowCount += 1
+}
+arrowWriter.finish()
+writer.writeBatch()
+arrowWriter.reset()
+  }
+} {
+  writer.end()
+  root.close()
+  allocator.close()
+  closed = true
+}
--- End diff --

nvm. `ArrowStreamPandasSerializer` is not a `FramedSerializer`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-26 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19184
  
@jerryshao Actually the second half of your comment is not valid in this 
case.
The PR is not targeting the merge sort in this case, but relevant when 
iterating over all tuples.

`UnsafeExternalSorter` has two methods to iterate over the tuples.
You are referring to `getSortedIterator` - which uses a PriorityQueue and 
requires all files to be opened at the same time (so that it can return a 
sorted iterator).

The primary usecase of this PR is for `getIterator` - where we are simply 
iterating over all tuples : and used in `ExternalAppendOnlyUnsafeRowArray` for 
example : there is no need to sort here.
This is used by various `WindowFunctionFrame` implementations for example.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141242888
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
 ---
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io._
+import java.net._
+import java.util.concurrent.atomic.AtomicBoolean
+
+import scala.collection.JavaConverters._
+
+import org.apache.arrow.vector.VectorSchemaRoot
+import org.apache.arrow.vector.stream.{ArrowStreamReader, 
ArrowStreamWriter}
+
+import org.apache.spark._
+import org.apache.spark.api.python._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
+import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, 
ColumnarBatch, ColumnVector}
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * Similar to `PythonUDFRunner`, but exchange data with Python worker via 
Arrow stream.
+ */
+class ArrowPythonRunner(
+funcs: Seq[ChainedPythonFunctions],
+batchSize: Int,
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]],
+schema: StructType)
+  extends BasePythonRunner[InternalRow, ColumnarBatch](
+funcs, bufferSize, reuseWorker, evalType, argOffsets) {
+
+  protected override def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[InternalRow],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread = {
+new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+  override def writeCommand(dataOut: DataOutputStream): Unit = {
+dataOut.writeInt(funcs.length)
+funcs.zip(argOffsets).foreach { case (chained, offsets) =>
+  dataOut.writeInt(offsets.length)
+  offsets.foreach { offset =>
+dataOut.writeInt(offset)
+  }
+  dataOut.writeInt(chained.funcs.length)
+  chained.funcs.foreach { f =>
+dataOut.writeInt(f.command.length)
+dataOut.write(f.command)
+  }
+}
+  }
+
+  override def writeIteratorToStream(dataOut: DataOutputStream): Unit 
= {
+val arrowSchema = ArrowUtils.toArrowSchema(schema)
+val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+  s"stdout writer for $pythonExec", 0, Long.MaxValue)
+
+val root = VectorSchemaRoot.create(arrowSchema, allocator)
+val arrowWriter = ArrowWriter.create(root)
+
+var closed = false
+
+context.addTaskCompletionListener { _ =>
+  if (!closed) {
+root.close()
+allocator.close()
+  }
+}
+
+val writer = new ArrowStreamWriter(root, null, dataOut)
+writer.start()
+
+Utils.tryWithSafeFinally {
+  while (inputIterator.hasNext) {
+var rowCount = 0
+while (inputIterator.hasNext && (batchSize <= 0 || rowCount < 
batchSize)) {
+  val row = inputIterator.next()
+  arrowWriter.write(row)
+  rowCount += 1
+}
+arrowWriter.finish()
+writer.writeBatch()
+arrowWriter.reset()
+  }
+} {
+  writer.end()
+  root.close()
+  allocator.close()
+  closed = true
+}
--- End diff --

I think we need to write out `END_OF_DATA_SECTION` after all data are 
written out?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141242553
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala
 ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io._
+import java.net._
+import java.util.concurrent.atomic.AtomicBoolean
+
+import org.apache.spark._
+import org.apache.spark.api.python._
+
+/**
+ * A helper class to run Python UDFs in Spark.
+ */
+class PythonUDFRunner(
+funcs: Seq[ChainedPythonFunctions],
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]])
+  extends BasePythonRunner[Array[Byte], Array[Byte]](
+funcs, bufferSize, reuseWorker, evalType, argOffsets) {
+
+  protected override def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[Array[Byte]],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread = {
+new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+  override def writeCommand(dataOut: DataOutputStream): Unit = {
--- End diff --

Looks like this implementation is no different than the `writeCommand` in 
`ArrowPythonRunner`? If so, I think we don't need to duplicate this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141242240
  
--- Diff: python/pyspark/serializers.py ---
@@ -251,6 +256,36 @@ def __repr__(self):
 return "ArrowPandasSerializer"
 
 
+class ArrowStreamPandasSerializer(Serializer):
+"""
+Serializes Pandas.Series as Arrow data with Arrow streaming format.
+"""
+
+def load_stream(self, stream):
+import pyarrow as pa
+reader = pa.open_stream(stream)
+for batch in reader:
+table = pa.Table.from_batches([batch])
+yield [c.to_pandas() for c in table.itercolumns()]
+
+def dump_stream(self, iterator, stream):
--- End diff --

Maybe add few comments for `dump_stream` and `load_stream` like 
`ArrowPandasSerializer`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141240246
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala
 ---
@@ -0,0 +1,197 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io._
+import java.net._
+import java.util.concurrent.atomic.AtomicBoolean
+
+import scala.collection.JavaConverters._
+
+import org.apache.arrow.vector.VectorSchemaRoot
+import org.apache.arrow.vector.stream.{ArrowStreamReader, 
ArrowStreamWriter}
+
+import org.apache.spark._
+import org.apache.spark.api.python._
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter}
+import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, 
ColumnarBatch, ColumnVector}
+import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
+
+/**
+ * Similar to `PythonUDFRunner`, but exchange data with Python worker via 
Arrow stream.
+ */
+class ArrowPythonRunner(
+funcs: Seq[ChainedPythonFunctions],
+batchSize: Int,
+bufferSize: Int,
+reuseWorker: Boolean,
+evalType: Int,
+argOffsets: Array[Array[Int]],
+schema: StructType)
+  extends BasePythonRunner[InternalRow, ColumnarBatch](
+funcs, bufferSize, reuseWorker, evalType, argOffsets) {
+
+  protected override def newWriterThread(
+  env: SparkEnv,
+  worker: Socket,
+  inputIterator: Iterator[InternalRow],
+  partitionIndex: Int,
+  context: TaskContext): WriterThread = {
+new WriterThread(env, worker, inputIterator, partitionIndex, context) {
+
+  override def writeCommand(dataOut: DataOutputStream): Unit = {
+dataOut.writeInt(funcs.length)
+funcs.zip(argOffsets).foreach { case (chained, offsets) =>
+  dataOut.writeInt(offsets.length)
+  offsets.foreach { offset =>
+dataOut.writeInt(offset)
+  }
+  dataOut.writeInt(chained.funcs.length)
+  chained.funcs.foreach { f =>
+dataOut.writeInt(f.command.length)
+dataOut.write(f.command)
+  }
+}
+  }
+
+  override def writeIteratorToStream(dataOut: DataOutputStream): Unit 
= {
+val arrowSchema = ArrowUtils.toArrowSchema(schema)
+val allocator = ArrowUtils.rootAllocator.newChildAllocator(
+  s"stdout writer for $pythonExec", 0, Long.MaxValue)
+
+val root = VectorSchemaRoot.create(arrowSchema, allocator)
+val arrowWriter = ArrowWriter.create(root)
+
+var closed = false
+
+context.addTaskCompletionListener { _ =>
+  if (!closed) {
+root.close()
+allocator.close()
+  }
+}
+
+val writer = new ArrowStreamWriter(root, null, dataOut)
+writer.start()
+
+Utils.tryWithSafeFinally {
+  while (inputIterator.hasNext) {
+var rowCount = 0
+while (inputIterator.hasNext && (batchSize <= 0 || rowCount < 
batchSize)) {
+  val row = inputIterator.next()
+  arrowWriter.write(row)
+  rowCount += 1
+}
+arrowWriter.finish()
+writer.writeBatch()
+arrowWriter.reset()
+  }
+} {
+  writer.end()
+  root.close()
+  allocator.close()
+  closed = true
+}
+  }
+}
+  }
+
+  protected override def newReaderIterator(
+  stream: DataInputStream,
+  writerThread: WriterThread,
+  startTime: Long,
+  env: SparkEnv,
+  worker: Socket,
+  released: AtomicBoolean,
+  context: 

[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19184
  
Hi @mridulm , sorry for late response. I agree with you that the scenario 
is different between here and shuffle, but the underlying structure and 
solutions to spill data is the same, so the problem is the same. While in the 
shuffle side, we could control the memory size to hold more data before 
spilling to avoid too many spills, but as you mentioned here we cannot do it. 

Yes it is not necessary to open all the files beforehand. But since we're 
using priority queue to do merge sort, which will make all the file handler 
opened very likely. And this fix only reduces the chances to encounter too many 
files issue. Maybe we can call this fix as an intermittent fix, what do you 
think?





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...

2017-09-26 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19349#discussion_r141237789
  
--- Diff: python/pyspark/serializers.py ---
@@ -211,33 +212,37 @@ def __repr__(self):
 return "ArrowSerializer"
 
 
+def _create_batch(series):
+import pyarrow as pa
+# Make input conform to [(series1, type1), (series2, type2), ...]
+if not isinstance(series, (list, tuple)) or \
+(len(series) == 2 and isinstance(series[1], pa.DataType)):
+series = [series]
+series = ((s, None) if not isinstance(s, (list, tuple)) else s for s 
in series)
+
+# If a nullable integer series has been promoted to floating point 
with NaNs, need to cast
+# NOTE: this is not necessary with Arrow >= 0.7
+def cast_series(s, t):
+if t is None or s.dtype == t.to_pandas_dtype():
+return s
+else:
+return s.fillna(0).astype(t.to_pandas_dtype(), copy=False)
+
+arrs = [pa.Array.from_pandas(cast_series(s, t), mask=s.isnull(), 
type=t) for s, t in series]
+return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in 
xrange(len(arrs))])
+
+
 class ArrowPandasSerializer(ArrowSerializer):
--- End diff --

Do we need to keep `ArrowPandasSerializer`? I don't see we use it other 
than in pandas udf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19358
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19358
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82218/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19358
  
**[Test build #82218 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82218/testReport)**
 for PR 19358 at commit 
[`a327419`](https://github.com/apache/spark/commit/a3274191d0dd886dfe3044262543ff7f56bf851e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141236393
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -670,9 +670,12 @@ private[spark] class TaskSetManager(
   }
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
-abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+abort(s"""
+  |Aborting $taskSet because task $indexInTaskSet (partition 
$partition)
+  |cannot run anywhere due to node and executor blacklist.
+  |Most recent failure:
+  |${taskSetBlacklist.getLatestFailureReason}\n
--- End diff --

It is my fault,scala version affected.My default scala is 2.10.4.And below 
is the result i tested with scala 2.11.2:
```
scala> val s="""ss\nss
 | sss\n"""
s: String =
ss\nss
sss\n

scala> print(s)
ss\nss
sss\n
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...

2017-09-26 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/17383


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19327
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82212/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data

2017-09-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17383
  
Hi, since the work has been done for a long time, I take a review by 
myself. 

After careful review, as SparseVector is compressed sparse row format, so 
the only benefit of the PR would be for data storage but in the cost of 
performance. But for tree-method, it is uncommon to handle a super large 
dimension features. Hence, it cannot satisfy me.

I prefer to [SPARK-3717: DecisionTree, RandomForest: Partition by 
feature](https://issues.apache.org/jira/browse/SPARK-3717) as an alternative, 
which will be benefits in both performance and storage if I understand 
correctly. So the PR is closed. Thank everyone for review / comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19327
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19327
  
**[Test build #82212 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82212/testReport)**
 for PR 19327 at commit 
[`680c51a`](https://github.com/apache/spark/commit/680c51a9cedd1370dec9d249c2eba8cdfc61fbd8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141235801
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -670,9 +670,12 @@ private[spark] class TaskSetManager(
   }
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
-abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+abort(s"""
+  |Aborting $taskSet because task $indexInTaskSet (partition 
$partition)
+  |cannot run anywhere due to node and executor blacklist.
+  |Most recent failure:
+  |${taskSetBlacklist.getLatestFailureReason}\n
--- End diff --

NVM, looks like whether using string interpolation the result is different.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141235459
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -670,9 +670,12 @@ private[spark] class TaskSetManager(
   }
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
-abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+abort(s"""
+  |Aborting $taskSet because task $indexInTaskSet (partition 
$partition)
+  |cannot run anywhere due to node and executor blacklist.
+  |Most recent failure:
+  |${taskSetBlacklist.getLatestFailureReason}\n
--- End diff --

Are you sure?

```scala
scala> val s = """fdsafdsa\n\nfdsafdsa"""
s: String = fdsafdsa\n\nfdsafdsa
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19358
  
**[Test build #82218 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82218/testReport)**
 for PR 19358 at commit 
[`a327419`](https://github.com/apache/spark/commit/a3274191d0dd886dfe3044262543ff7f56bf851e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19358
  
Would you please add a [MESOS]tag in your PR title, like other PR did. 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19358
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141235079
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -670,9 +670,12 @@ private[spark] class TaskSetManager(
   }
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
-abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+abort(s"""
+  |Aborting $taskSet because task $indexInTaskSet (partition 
$partition)
+  |cannot run anywhere due to node and executor blacklist.
+  |Most recent failure:
+  |${taskSetBlacklist.getLatestFailureReason}\n
--- End diff --

Actually i tested locally with something like below:
```
scala> val s = s"""
 | sss\n
 | sss"""
scala> print(s)

sss

sss
```
And i found that there is no need to add \n since triple quoted designed to 
avoid such character.Sorry for that,and i have updated before your newest 
comment .Thanks @jerryshao 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19338
  
**[Test build #82217 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82217/testReport)**
 for PR 19338 at commit 
[`05adc2a`](https://github.com/apache/spark/commit/05adc2adf1d87a1a3044903bdd5d54dd1e39adc7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19359
  
**[Test build #82216 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82216/testReport)**
 for PR 19359 at commit 
[`5537787`](https://github.com/apache/spark/commit/553778785891cc973a28dc05e57414eb759d8f66).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType

2017-09-26 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19330
  
It seems the failed SparkR unit tests is not related.
In current change, I added `trait OrderSpecified`, 
expressions(`BinaryComparison`, `Max`, `Min`, `SortArray`, `SortOrder`) using 
ordering of data type extends `OrderSpecified`. During analyzing, expressions 
with map type output will be wrapped in `OrderMaps` , thus changed to be an 
ordered map.
Expressions with map type output in `Aggregate`, `Distinct` and 
`SetOperation` will also be wrapped in `OrderMaps`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19359: [SPARK-22129][SPARK-22138] Release script improve...

2017-09-26 Thread holdenk
GitHub user holdenk opened a pull request:

https://github.com/apache/spark/pull/19359

[SPARK-22129][SPARK-22138] Release script improvements

## What changes were proposed in this pull request?

Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap 
since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts 
as well.

## How was this patch tested?

Rolled 2.1.2 RC2

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/holdenk/spark SPARK-22129-fix-signing

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19359.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19359


commit 553778785891cc973a28dc05e57414eb759d8f66
Author: Holden Karau 
Date:   2017-09-27T02:41:00Z

use GPG_KEY to control the GPG key. Add retry. Fix lsof to not be hard 
coded and just use path version. Remove version swapping on publish-release 
since we only need it some of the time and it was confusing.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141234485
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -670,9 +670,12 @@ private[spark] class TaskSetManager(
   }
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
-abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+abort(s"""
+  |Aborting $taskSet because task $indexInTaskSet (partition 
$partition)
+  |cannot run anywhere due to node and executor blacklist.
+  |Most recent failure:
+  |${taskSetBlacklist.getLatestFailureReason}\n
--- End diff --

You don't have to add "\n", since you already use triple quoted format. 
Please trying to get understand the basic of this API before modifying it. 

Also verified in locally before pushing a new commit, and update the PR 
description to reflect your new format accordingly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...

2017-09-26 Thread goldmedal
Github user goldmedal commented on the issue:

https://github.com/apache/spark/pull/19339
  
@HyukjinKwon @viirya  Thanks for your reviewing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree

2017-09-26 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/17503
  
HI, @WeichenXu123. 

As said by @srowen , the benefit of this would be for speed at predict time 
or for model storage. Hence I'm not sure whether benchmark is really need for 
the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19338
  
**[Test build #82215 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82215/testReport)**
 for PR 19338 at commit 
[`4d906b7`](https://github.com/apache/spark/commit/4d906b772ba1d893455d5b5676a898d81f035ad5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...

2017-09-26 Thread goldmedal
Github user goldmedal commented on a diff in the pull request:

https://github.com/apache/spark/pull/19339#discussion_r141232322
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -420,7 +425,29 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 columnNameOfCorruptRecord=columnNameOfCorruptRecord, 
multiLine=multiLine)
 if isinstance(path, basestring):
 path = [path]
-return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+if type(path) == list:
+return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+elif isinstance(path, RDD):
+def func(iterator):
+for x in iterator:
+if not isinstance(x, basestring):
+x = unicode(x)
+if isinstance(x, unicode):
+x = x.encode("utf-8")
+yield x
+keyed = path.mapPartitions(func)
+keyed._bypass_serializer = True
+jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+# see SPARK-22112
+# There aren't any jvm api for creating a dataframe from rdd 
storing csv.
--- End diff --

Ok thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19339#discussion_r141231147
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -420,7 +425,29 @@ def csv(self, path, schema=None, sep=None, 
encoding=None, quote=None, escape=Non
 columnNameOfCorruptRecord=columnNameOfCorruptRecord, 
multiLine=multiLine)
 if isinstance(path, basestring):
 path = [path]
-return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+if type(path) == list:
+return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
+elif isinstance(path, RDD):
+def func(iterator):
+for x in iterator:
+if not isinstance(x, basestring):
+x = unicode(x)
+if isinstance(x, unicode):
+x = x.encode("utf-8")
+yield x
+keyed = path.mapPartitions(func)
+keyed._bypass_serializer = True
+jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString())
+# see SPARK-22112
+# There aren't any jvm api for creating a dataframe from rdd 
storing csv.
--- End diff --

Let's fix these comments like,

```
SPARK-22112: There aren't any jvm api for creating a dataframe from rdd 
storing csv.
...
```

or 

```
There aren't any jvm api ...
...
for creating a dataframe from dataset storing csv. See SPARK-22112.
```

when we happened to fix some code around here or review other PRs fixing 
some codes around here in the future.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19349
  
**[Test build #82214 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82214/testReport)**
 for PR 19349 at commit 
[`7f6e43f`](https://github.com/apache/spark/commit/7f6e43f7b291b0e5f60914bcbf6a1a7dee92820f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...

2017-09-26 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19339
  
I've tested few times locally. Can't have the same failure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19339


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19339
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19339
  
Thanks @goldmedal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141229038
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -671,8 +671,10 @@ private[spark] class TaskSetManager(
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
 abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+  s"cannot run anywhere due to node and executor blacklist.\n" 
+
--- End diff --

Nice @jerryshao i will update as soon as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...

2017-09-26 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19229
  
ping @gatorsmile for the SQL part.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19168: [SPARK-21956][CORE] Fetch up to max bytes when bu...

2017-09-26 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19168#discussion_r141228856
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala 
---
@@ -439,6 +443,8 @@ final class ShuffleBlockFetcherIterator(
   fetchRequests += FetchRequest(address, Array((blockId, 
size)))
   result = null
 }
+bytesInFlight -= size
+fetchUpToMaxBytes()
--- End diff --

Thanks @cloud-fan 
IIUC, i think when  `fetchfailedException` thrown there is no need to 
consider this.Which is also the original design.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...

2017-09-26 Thread goldmedal
Github user goldmedal commented on the issue:

https://github.com/apache/spark/pull/19339
  
@HyukjinKwon I has updated this title. Thanks !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141226978
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -671,8 +671,10 @@ private[spark] class TaskSetManager(
   if (blacklistedEverywhere) {
 val partition = tasks(indexInTaskSet).partitionId
 abort(s"Aborting $taskSet because task $indexInTaskSet 
(partition $partition) " +
-  s"cannot run anywhere due to node and executor blacklist.  
Blacklisting behavior " +
-  s"can be configured via spark.blacklist.*.")
+  s"cannot run anywhere due to node and executor blacklist.\n" 
+
--- End diff --

Can we change to Scala's triple quoted string interpolation here? you can 
refer to 
[here](https://github.com/apache/spark/blob/ceaec93839d18a20e0cd78b70f3ea71872dce0a4/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala#L211)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread caneGuy
Github user caneGuy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141226765
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala ---
@@ -61,6 +61,16 @@ private[scheduler] class TaskSetBlacklist(val conf: 
SparkConf, val stageId: Int,
   private val blacklistedExecs = new HashSet[String]()
   private val blacklistedNodes = new HashSet[String]()
 
+  private var latestFailureReason: String = null
+
+  /**
+   * Get the most recent failure reason of this TaskSet.
+   * @return
+   */
+  def getLatestFailureReason: String = {
--- End diff --

Thanks @jerryshao @squito Could you help trigger an other jenkins 
test?Since last one has pySpark failure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...

2017-09-26 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19338
  
Thanks @squito i have updated the description


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...

2017-09-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/19338#discussion_r141226456
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala ---
@@ -61,6 +61,16 @@ private[scheduler] class TaskSetBlacklist(val conf: 
SparkConf, val stageId: Int,
   private val blacklistedExecs = new HashSet[String]()
   private val blacklistedNodes = new HashSet[String]()
 
+  private var latestFailureReason: String = null
+
+  /**
+   * Get the most recent failure reason of this TaskSet.
+   * @return
+   */
+  def getLatestFailureReason: String = {
--- End diff --

@squito yes from scope level it is fine. My thought is that this exposes 
the class member to other class unnecessarily. Yeah it is not a big deal, just 
my personal preference.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-26 Thread fjh100456
Github user fjh100456 commented on the issue:

https://github.com/apache/spark/pull/19218
  
@dongjoon-hyun 
Thank you very much, I'll fix them as soon as possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19351: [SPARK-22127][CORE]The Master Register Applicatio...

2017-09-26 Thread guoxiaolongzte
Github user guoxiaolongzte commented on a diff in the pull request:

https://github.com/apache/spark/pull/19351#discussion_r141224687
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala 
---
@@ -265,6 +265,9 @@ private[deploy] class Master(
 val app = createApplication(description, driver)
 registerApplication(app)
 logInfo("Registered app " + description.name + " with ID " + 
app.id)
+if(app.state == ApplicationState.WAITING) {
--- End diff --

it is a warning.
Because there is no resource allocation itself may lead to spark business 
has been waiting, there will be problems.

Now I fix as shown below,Do you think this is correct?

 if(app.state == ApplicationState.WAITING) {
  logWarning("App need to wait worker Launch executor," +
"maybe worker does not have extra resources to allocate")
}


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19346: [SAPRK-20785][WEB-UI][SQL] Spark should provide jump lin...

2017-09-26 Thread guoxiaolongzte
Github user guoxiaolongzte commented on the issue:

https://github.com/apache/spark/pull/19346
  
@gatorsmile Help to review the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19358
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19327
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82211/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19358: [SPARK-22135] metrics in spark-dispatcher not bei...

2017-09-26 Thread pmackles
GitHub user pmackles opened a pull request:

https://github.com/apache/spark/pull/19358

[SPARK-22135] metrics in spark-dispatcher not being registered properly

## What changes were proposed in this pull request?

Fix a trivial bug with how metrics are registered in the mesos dispatcher. 
Bug resulted in creating a new registry each time the metricRegistry() method 
was called. 

## How was this patch tested?

Verified manually on local mesos setup



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pmackles/spark SPARK-22135

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19358.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19358


commit a3274191d0dd886dfe3044262543ff7f56bf851e
Author: Paul Mackles 
Date:   2017-09-27T00:47:21Z

[SPARK-22135] metrics in spark-dispatcher not being registered properly




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19327
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19327
  
**[Test build #82211 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82211/testReport)**
 for PR 19327 at commit 
[`a8aa356`](https://github.com/apache/spark/commit/a8aa35691d7603c0ca283dc2aff295031bcc3b9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19357
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82213/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19357
  
**[Test build #82213 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82213/testReport)**
 for PR 19357 at commit 
[`46af54d`](https://github.com/apache/spark/commit/46af54d9c86fa1e5322fdd92ed47fe3d419dd966).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class NumericEquiHeightHgm(`
  * `case class StringEquiHeightHgm(`
  * `case class NumericEquiHeightBin(lowerBound: Double, upperBound: 
Double, binNdv: Long)`
  * `case class StringEquiHeightBin(upperBound: String, binNdv: Long)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19357
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19357
  
**[Test build #82213 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82213/testReport)**
 for PR 19357 at commit 
[`46af54d`](https://github.com/apache/spark/commit/46af54d9c86fa1e5322fdd92ed47fe3d419dd966).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Add an API to create a DataFrame ...

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19339
  
@goldmedal, are you online now? how about fixing the PR title to say 
something like .. "Supports RDD of strings as input in spark.read.csv in 
PySpark"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19357
  
Jenkins, add to whitelist


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...

2017-09-26 Thread ron8hu
Github user ron8hu commented on the issue:

https://github.com/apache/spark/pull/19357
  
cc @wzhfy Please review code first before I request the community to review 
it.  Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19335: mapPartitions Api

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19335


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19348: [BUILD] Close stale PRs

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19348


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19347: Branch 2.2 sparkmlib's output of many algorithms ...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19347


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19244: SPARK-22021

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19244


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18474: [SPARK-21235][TESTS] UTest should clear temp resu...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18474


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18253: [SPARK-18838][CORE] Introduce multiple queues in ...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18253


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18978: [SPARK-21737][YARN]Create communication channel b...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18978


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18897: [SPARK-21655][YARN] Support Kill CLI for Yarn mod...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18897


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19334: Branch 1.6

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19334


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15009: [SPARK-17443][SPARK-11035] Stop Spark Application...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15009


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19315: [MINOR][ML]Updated english.txt word ordering

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19315


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19295: [SPARK-22080][SQL] Adds support for allowing user...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19295


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19356: Merge pull request #1 from apache/master

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19356


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19300: [SPARK-22082][SparkR]Spelling mistake: "choosen" ...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19300


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 Pa...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19152


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19238: [SPARK-22016][SQL] Add HiveDialect for JDBC conne...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19238


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13794


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19236: SPARK-22015: Remove usage of non-used private fie...

2017-09-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19236


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19357: support histogram in filter cardinality estimation

2017-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19357
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19327: [WIP] Implement stream-stream outer joins.

2017-09-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19327
  
**[Test build #82212 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82212/testReport)**
 for PR 19327 at commit 
[`680c51a`](https://github.com/apache/spark/commit/680c51a9cedd1370dec9d249c2eba8cdfc61fbd8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19348: [BUILD] Close stale PRs

2017-09-26 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19348
  
Merged to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >