[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19338 **[Test build #82220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82220/testReport)** for PR 19338 at commit [`9a2d7ba`](https://github.com/apache/spark/commit/9a2d7ba342481824081951087aaf5281858898c0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19338 There's one related test failure, can you please check. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19337 **[Test build #82219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82219/testReport)** for PR 19337 at commit [`29f05c7`](https://github.com/apache/spark/commit/29f05c7e37fe7d6f5dcf62bce49a4820137dcc1b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19349 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82214/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19359 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82216/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19359 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19349 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19349 **[Test build #82214 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82214/testReport)** for PR 19349 at commit [`7f6e43f`](https://github.com/apache/spark/commit/7f6e43f7b291b0e5f60914bcbf6a1a7dee92820f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19359 **[Test build #82216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82216/testReport)** for PR 19359 at commit [`5537787`](https://github.com/apache/spark/commit/553778785891cc973a28dc05e57414eb759d8f66). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19338 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82217/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19338 **[Test build #82217 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82217/testReport)** for PR 19338 at commit [`05adc2a`](https://github.com/apache/spark/commit/05adc2adf1d87a1a3044903bdd5d54dd1e39adc7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141246604 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io._ +import java.net._ +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import org.apache.arrow.vector.VectorSchemaRoot +import org.apache.arrow.vector.stream.{ArrowStreamReader, ArrowStreamWriter} + +import org.apache.spark._ +import org.apache.spark.api.python._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter} +import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * Similar to `PythonUDFRunner`, but exchange data with Python worker via Arrow stream. + */ +class ArrowPythonRunner( +funcs: Seq[ChainedPythonFunctions], +batchSize: Int, +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]], +schema: StructType) + extends BasePythonRunner[InternalRow, ColumnarBatch]( +funcs, bufferSize, reuseWorker, evalType, argOffsets) { + + protected override def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[InternalRow], + partitionIndex: Int, + context: TaskContext): WriterThread = { +new WriterThread(env, worker, inputIterator, partitionIndex, context) { + + override def writeCommand(dataOut: DataOutputStream): Unit = { +dataOut.writeInt(funcs.length) +funcs.zip(argOffsets).foreach { case (chained, offsets) => + dataOut.writeInt(offsets.length) + offsets.foreach { offset => +dataOut.writeInt(offset) + } + dataOut.writeInt(chained.funcs.length) + chained.funcs.foreach { f => +dataOut.writeInt(f.command.length) +dataOut.write(f.command) + } +} + } + + override def writeIteratorToStream(dataOut: DataOutputStream): Unit = { +val arrowSchema = ArrowUtils.toArrowSchema(schema) +val allocator = ArrowUtils.rootAllocator.newChildAllocator( + s"stdout writer for $pythonExec", 0, Long.MaxValue) + +val root = VectorSchemaRoot.create(arrowSchema, allocator) +val arrowWriter = ArrowWriter.create(root) + +var closed = false + +context.addTaskCompletionListener { _ => + if (!closed) { +root.close() +allocator.close() + } +} + +val writer = new ArrowStreamWriter(root, null, dataOut) +writer.start() + +Utils.tryWithSafeFinally { + while (inputIterator.hasNext) { +var rowCount = 0 +while (inputIterator.hasNext && (batchSize <= 0 || rowCount < batchSize)) { + val row = inputIterator.next() + arrowWriter.write(row) + rowCount += 1 +} +arrowWriter.finish() +writer.writeBatch() +arrowWriter.reset() + } +} { + writer.end() + root.close() + allocator.close() + closed = true +} + } +} + } + + protected override def newReaderIterator( + stream: DataInputStream, + writerThread: WriterThread, + startTime: Long, + env: SparkEnv, + worker: Socket, + released: AtomicBoolean, + context:
[GitHub] spark pull request #19346: [SAPRK-20785][WEB-UI][SQL] Spark should provide j...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19346#discussion_r141246595 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ui/AllExecutionsPage.scala --- @@ -61,7 +61,36 @@ private[ui] class AllExecutionsPage(parent: SQLTab) extends WebUIPage("") with L details.parentNode.querySelector('.stage-details').classList.toggle('collapsed') }} -UIUtils.headerSparkPage("SQL", content, parent, Some(5000)) +val summary: NodeSeq = + + + { +if (listener.getRunningExecutions.nonEmpty) { + +Running Queries: +{listener.getRunningExecutions.size} + +} + } + { +if (listener.getCompletedExecutions.nonEmpty) { + --- End diff -- I think `completed-summary` is for ui test on jobs and stages pages. We may no need to use it here if no test for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141244651 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala --- @@ -0,0 +1,429 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.python + +import java.io._ +import java.net._ +import java.nio.charset.StandardCharsets +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import org.apache.spark._ +import org.apache.spark.internal.Logging +import org.apache.spark.util._ + + +/** + * Enumerate the type of command that will be sent to the Python worker + */ +private[spark] object PythonEvalType { + val NON_UDF = 0 + val SQL_BATCHED_UDF = 1 + val SQL_PANDAS_UDF = 2 +} + +/** + * A helper class to run Python mapPartition/UDFs in Spark. + * + * funcs is a list of independent Python functions, each one of them is a list of chained Python + * functions (from bottom to top). + */ +private[spark] abstract class BasePythonRunner[IN, OUT]( +funcs: Seq[ChainedPythonFunctions], +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]]) + extends Logging { + + require(funcs.length == argOffsets.length, "argOffsets should have the same length as funcs") + + // All the Python functions should have the same exec, version and envvars. + protected val envVars = funcs.head.funcs.head.envVars + protected val pythonExec = funcs.head.funcs.head.pythonExec + protected val pythonVer = funcs.head.funcs.head.pythonVer + + // TODO: support accumulator in multiple UDF + protected val accumulator = funcs.head.funcs.head.accumulator + + def compute( + inputIterator: Iterator[IN], + partitionIndex: Int, + context: TaskContext): Iterator[OUT] = { +val startTime = System.currentTimeMillis +val env = SparkEnv.get +val localdir = env.blockManager.diskBlockManager.localDirs.map(f => f.getPath()).mkString(",") +envVars.put("SPARK_LOCAL_DIRS", localdir) // it's also used in monitor thread +if (reuseWorker) { + envVars.put("SPARK_REUSE_WORKER", "1") +} +val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap) +// Whether is the worker released into idle pool +val released = new AtomicBoolean(false) + +// Start a thread to feed the process input from our parent's iterator +val writerThread = newWriterThread(env, worker, inputIterator, partitionIndex, context) + +context.addTaskCompletionListener { context => + writerThread.shutdownOnTaskCompletion() + if (!reuseWorker || !released.get) { +try { + worker.close() +} catch { + case e: Exception => +logWarning("Failed to close worker socket", e) +} + } +} + +writerThread.start() +new MonitorThread(env, worker, context).start() + +// Return an iterator that read lines from the process's stdout +val stream = new DataInputStream(new BufferedInputStream(worker.getInputStream, bufferSize)) + +val stdoutIterator = newReaderIterator( + stream, writerThread, startTime, env, worker, released, context) +new InterruptibleIterator(context, stdoutIterator) + } + + protected def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[IN], + partitionIndex: Int, + context: TaskContext): WriterThread + + protected def newReaderIterator( + stream: DataInputStream, + writerThread: WriterThread, + startTime: Long, + env: SparkEnv, + worker: Socket, + released: AtomicBoolean, + context: TaskContext): Iterator[OUT] +
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19338 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82215/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19338 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19338 **[Test build #82215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82215/testReport)** for PR 19338 at commit [`4d906b7`](https://github.com/apache/spark/commit/4d906b772ba1d893455d5b5676a898d81f035ad5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141243843 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io._ +import java.net._ +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import org.apache.arrow.vector.VectorSchemaRoot +import org.apache.arrow.vector.stream.{ArrowStreamReader, ArrowStreamWriter} + +import org.apache.spark._ +import org.apache.spark.api.python._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter} +import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * Similar to `PythonUDFRunner`, but exchange data with Python worker via Arrow stream. + */ +class ArrowPythonRunner( +funcs: Seq[ChainedPythonFunctions], +batchSize: Int, +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]], +schema: StructType) + extends BasePythonRunner[InternalRow, ColumnarBatch]( +funcs, bufferSize, reuseWorker, evalType, argOffsets) { + + protected override def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[InternalRow], + partitionIndex: Int, + context: TaskContext): WriterThread = { +new WriterThread(env, worker, inputIterator, partitionIndex, context) { + + override def writeCommand(dataOut: DataOutputStream): Unit = { +dataOut.writeInt(funcs.length) +funcs.zip(argOffsets).foreach { case (chained, offsets) => + dataOut.writeInt(offsets.length) + offsets.foreach { offset => +dataOut.writeInt(offset) + } + dataOut.writeInt(chained.funcs.length) + chained.funcs.foreach { f => +dataOut.writeInt(f.command.length) +dataOut.write(f.command) + } +} + } + + override def writeIteratorToStream(dataOut: DataOutputStream): Unit = { +val arrowSchema = ArrowUtils.toArrowSchema(schema) +val allocator = ArrowUtils.rootAllocator.newChildAllocator( + s"stdout writer for $pythonExec", 0, Long.MaxValue) + +val root = VectorSchemaRoot.create(arrowSchema, allocator) +val arrowWriter = ArrowWriter.create(root) + +var closed = false + +context.addTaskCompletionListener { _ => + if (!closed) { +root.close() +allocator.close() + } +} + +val writer = new ArrowStreamWriter(root, null, dataOut) +writer.start() + +Utils.tryWithSafeFinally { + while (inputIterator.hasNext) { +var rowCount = 0 +while (inputIterator.hasNext && (batchSize <= 0 || rowCount < batchSize)) { + val row = inputIterator.next() + arrowWriter.write(row) + rowCount += 1 +} +arrowWriter.finish() +writer.writeBatch() +arrowWriter.reset() + } +} { + writer.end() + root.close() + allocator.close() + closed = true +} --- End diff -- nvm. `ArrowStreamPandasSerializer` is not a `FramedSerializer`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19184 @jerryshao Actually the second half of your comment is not valid in this case. The PR is not targeting the merge sort in this case, but relevant when iterating over all tuples. `UnsafeExternalSorter` has two methods to iterate over the tuples. You are referring to `getSortedIterator` - which uses a PriorityQueue and requires all files to be opened at the same time (so that it can return a sorted iterator). The primary usecase of this PR is for `getIterator` - where we are simply iterating over all tuples : and used in `ExternalAppendOnlyUnsafeRowArray` for example : there is no need to sort here. This is used by various `WindowFunctionFrame` implementations for example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141242888 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io._ +import java.net._ +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import org.apache.arrow.vector.VectorSchemaRoot +import org.apache.arrow.vector.stream.{ArrowStreamReader, ArrowStreamWriter} + +import org.apache.spark._ +import org.apache.spark.api.python._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter} +import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * Similar to `PythonUDFRunner`, but exchange data with Python worker via Arrow stream. + */ +class ArrowPythonRunner( +funcs: Seq[ChainedPythonFunctions], +batchSize: Int, +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]], +schema: StructType) + extends BasePythonRunner[InternalRow, ColumnarBatch]( +funcs, bufferSize, reuseWorker, evalType, argOffsets) { + + protected override def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[InternalRow], + partitionIndex: Int, + context: TaskContext): WriterThread = { +new WriterThread(env, worker, inputIterator, partitionIndex, context) { + + override def writeCommand(dataOut: DataOutputStream): Unit = { +dataOut.writeInt(funcs.length) +funcs.zip(argOffsets).foreach { case (chained, offsets) => + dataOut.writeInt(offsets.length) + offsets.foreach { offset => +dataOut.writeInt(offset) + } + dataOut.writeInt(chained.funcs.length) + chained.funcs.foreach { f => +dataOut.writeInt(f.command.length) +dataOut.write(f.command) + } +} + } + + override def writeIteratorToStream(dataOut: DataOutputStream): Unit = { +val arrowSchema = ArrowUtils.toArrowSchema(schema) +val allocator = ArrowUtils.rootAllocator.newChildAllocator( + s"stdout writer for $pythonExec", 0, Long.MaxValue) + +val root = VectorSchemaRoot.create(arrowSchema, allocator) +val arrowWriter = ArrowWriter.create(root) + +var closed = false + +context.addTaskCompletionListener { _ => + if (!closed) { +root.close() +allocator.close() + } +} + +val writer = new ArrowStreamWriter(root, null, dataOut) +writer.start() + +Utils.tryWithSafeFinally { + while (inputIterator.hasNext) { +var rowCount = 0 +while (inputIterator.hasNext && (batchSize <= 0 || rowCount < batchSize)) { + val row = inputIterator.next() + arrowWriter.write(row) + rowCount += 1 +} +arrowWriter.finish() +writer.writeBatch() +arrowWriter.reset() + } +} { + writer.end() + root.close() + allocator.close() + closed = true +} --- End diff -- I think we need to write out `END_OF_DATA_SECTION` after all data are written out? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141242553 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonUDFRunner.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io._ +import java.net._ +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.spark._ +import org.apache.spark.api.python._ + +/** + * A helper class to run Python UDFs in Spark. + */ +class PythonUDFRunner( +funcs: Seq[ChainedPythonFunctions], +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]]) + extends BasePythonRunner[Array[Byte], Array[Byte]]( +funcs, bufferSize, reuseWorker, evalType, argOffsets) { + + protected override def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[Array[Byte]], + partitionIndex: Int, + context: TaskContext): WriterThread = { +new WriterThread(env, worker, inputIterator, partitionIndex, context) { + + override def writeCommand(dataOut: DataOutputStream): Unit = { --- End diff -- Looks like this implementation is no different than the `writeCommand` in `ArrowPythonRunner`? If so, I think we don't need to duplicate this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141242240 --- Diff: python/pyspark/serializers.py --- @@ -251,6 +256,36 @@ def __repr__(self): return "ArrowPandasSerializer" +class ArrowStreamPandasSerializer(Serializer): +""" +Serializes Pandas.Series as Arrow data with Arrow streaming format. +""" + +def load_stream(self, stream): +import pyarrow as pa +reader = pa.open_stream(stream) +for batch in reader: +table = pa.Table.from_batches([batch]) +yield [c.to_pandas() for c in table.itercolumns()] + +def dump_stream(self, iterator, stream): --- End diff -- Maybe add few comments for `dump_stream` and `load_stream` like `ArrowPandasSerializer`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141240246 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala --- @@ -0,0 +1,197 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io._ +import java.net._ +import java.util.concurrent.atomic.AtomicBoolean + +import scala.collection.JavaConverters._ + +import org.apache.arrow.vector.VectorSchemaRoot +import org.apache.arrow.vector.stream.{ArrowStreamReader, ArrowStreamWriter} + +import org.apache.spark._ +import org.apache.spark.api.python._ +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.arrow.{ArrowUtils, ArrowWriter} +import org.apache.spark.sql.execution.vectorized.{ArrowColumnVector, ColumnarBatch, ColumnVector} +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + +/** + * Similar to `PythonUDFRunner`, but exchange data with Python worker via Arrow stream. + */ +class ArrowPythonRunner( +funcs: Seq[ChainedPythonFunctions], +batchSize: Int, +bufferSize: Int, +reuseWorker: Boolean, +evalType: Int, +argOffsets: Array[Array[Int]], +schema: StructType) + extends BasePythonRunner[InternalRow, ColumnarBatch]( +funcs, bufferSize, reuseWorker, evalType, argOffsets) { + + protected override def newWriterThread( + env: SparkEnv, + worker: Socket, + inputIterator: Iterator[InternalRow], + partitionIndex: Int, + context: TaskContext): WriterThread = { +new WriterThread(env, worker, inputIterator, partitionIndex, context) { + + override def writeCommand(dataOut: DataOutputStream): Unit = { +dataOut.writeInt(funcs.length) +funcs.zip(argOffsets).foreach { case (chained, offsets) => + dataOut.writeInt(offsets.length) + offsets.foreach { offset => +dataOut.writeInt(offset) + } + dataOut.writeInt(chained.funcs.length) + chained.funcs.foreach { f => +dataOut.writeInt(f.command.length) +dataOut.write(f.command) + } +} + } + + override def writeIteratorToStream(dataOut: DataOutputStream): Unit = { +val arrowSchema = ArrowUtils.toArrowSchema(schema) +val allocator = ArrowUtils.rootAllocator.newChildAllocator( + s"stdout writer for $pythonExec", 0, Long.MaxValue) + +val root = VectorSchemaRoot.create(arrowSchema, allocator) +val arrowWriter = ArrowWriter.create(root) + +var closed = false + +context.addTaskCompletionListener { _ => + if (!closed) { +root.close() +allocator.close() + } +} + +val writer = new ArrowStreamWriter(root, null, dataOut) +writer.start() + +Utils.tryWithSafeFinally { + while (inputIterator.hasNext) { +var rowCount = 0 +while (inputIterator.hasNext && (batchSize <= 0 || rowCount < batchSize)) { + val row = inputIterator.next() + arrowWriter.write(row) + rowCount += 1 +} +arrowWriter.finish() +writer.writeBatch() +arrowWriter.reset() + } +} { + writer.end() + root.close() + allocator.close() + closed = true +} + } +} + } + + protected override def newReaderIterator( + stream: DataInputStream, + writerThread: WriterThread, + startTime: Long, + env: SparkEnv, + worker: Socket, + released: AtomicBoolean, + context:
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19184 Hi @mridulm , sorry for late response. I agree with you that the scenario is different between here and shuffle, but the underlying structure and solutions to spill data is the same, so the problem is the same. While in the shuffle side, we could control the memory size to hold more data before spilling to avoid too many spills, but as you mentioned here we cannot do it. Yes it is not necessary to open all the files beforehand. But since we're using priority queue to do merge sort, which will make all the file handler opened very likely. And this fix only reduces the chances to encounter too many files issue. Maybe we can call this fix as an intermittent fix, what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19349#discussion_r141237789 --- Diff: python/pyspark/serializers.py --- @@ -211,33 +212,37 @@ def __repr__(self): return "ArrowSerializer" +def _create_batch(series): +import pyarrow as pa +# Make input conform to [(series1, type1), (series2, type2), ...] +if not isinstance(series, (list, tuple)) or \ +(len(series) == 2 and isinstance(series[1], pa.DataType)): +series = [series] +series = ((s, None) if not isinstance(s, (list, tuple)) else s for s in series) + +# If a nullable integer series has been promoted to floating point with NaNs, need to cast +# NOTE: this is not necessary with Arrow >= 0.7 +def cast_series(s, t): +if t is None or s.dtype == t.to_pandas_dtype(): +return s +else: +return s.fillna(0).astype(t.to_pandas_dtype(), copy=False) + +arrs = [pa.Array.from_pandas(cast_series(s, t), mask=s.isnull(), type=t) for s, t in series] +return pa.RecordBatch.from_arrays(arrs, ["_%d" % i for i in xrange(len(arrs))]) + + class ArrowPandasSerializer(ArrowSerializer): --- End diff -- Do we need to keep `ArrowPandasSerializer`? I don't see we use it other than in pandas udf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19358 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19358 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82218/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19358 **[Test build #82218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82218/testReport)** for PR 19358 at commit [`a327419`](https://github.com/apache/spark/commit/a3274191d0dd886dfe3044262543ff7f56bf851e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141236393 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -670,9 +670,12 @@ private[spark] class TaskSetManager( } if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId -abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") +abort(s""" + |Aborting $taskSet because task $indexInTaskSet (partition $partition) + |cannot run anywhere due to node and executor blacklist. + |Most recent failure: + |${taskSetBlacklist.getLatestFailureReason}\n --- End diff -- It is my fault,scala version affected.My default scala is 2.10.4.And below is the result i tested with scala 2.11.2: ``` scala> val s="""ss\nss | sss\n""" s: String = ss\nss sss\n scala> print(s) ss\nss sss\n ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in ...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/17383 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82212/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17383: [SPARK-3165][MLlib] DecisionTree use sparsity in data
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17383 Hi, since the work has been done for a long time, I take a review by myself. After careful review, as SparseVector is compressed sparse row format, so the only benefit of the PR would be for data storage but in the cost of performance. But for tree-method, it is uncommon to handle a super large dimension features. Hence, it cannot satisfy me. I prefer to [SPARK-3717: DecisionTree, RandomForest: Partition by feature](https://issues.apache.org/jira/browse/SPARK-3717) as an alternative, which will be benefits in both performance and storage if I understand correctly. So the PR is closed. Thank everyone for review / comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19327 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19327 **[Test build #82212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82212/testReport)** for PR 19327 at commit [`680c51a`](https://github.com/apache/spark/commit/680c51a9cedd1370dec9d249c2eba8cdfc61fbd8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141235801 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -670,9 +670,12 @@ private[spark] class TaskSetManager( } if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId -abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") +abort(s""" + |Aborting $taskSet because task $indexInTaskSet (partition $partition) + |cannot run anywhere due to node and executor blacklist. + |Most recent failure: + |${taskSetBlacklist.getLatestFailureReason}\n --- End diff -- NVM, looks like whether using string interpolation the result is different. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141235459 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -670,9 +670,12 @@ private[spark] class TaskSetManager( } if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId -abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") +abort(s""" + |Aborting $taskSet because task $indexInTaskSet (partition $partition) + |cannot run anywhere due to node and executor blacklist. + |Most recent failure: + |${taskSetBlacklist.getLatestFailureReason}\n --- End diff -- Are you sure? ```scala scala> val s = """fdsafdsa\n\nfdsafdsa""" s: String = fdsafdsa\n\nfdsafdsa ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19358 **[Test build #82218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82218/testReport)** for PR 19358 at commit [`a327419`](https://github.com/apache/spark/commit/a3274191d0dd886dfe3044262543ff7f56bf851e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19358 Would you please add a [MESOS]tag in your PR title, like other PR did. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19358 ok to test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141235079 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -670,9 +670,12 @@ private[spark] class TaskSetManager( } if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId -abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") +abort(s""" + |Aborting $taskSet because task $indexInTaskSet (partition $partition) + |cannot run anywhere due to node and executor blacklist. + |Most recent failure: + |${taskSetBlacklist.getLatestFailureReason}\n --- End diff -- Actually i tested locally with something like below: ``` scala> val s = s""" | sss\n | sss""" scala> print(s) sss sss ``` And i found that there is no need to add \n since triple quoted designed to avoid such character.Sorry for that,and i have updated before your newest comment .Thanks @jerryshao --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19338 **[Test build #82217 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82217/testReport)** for PR 19338 at commit [`05adc2a`](https://github.com/apache/spark/commit/05adc2adf1d87a1a3044903bdd5d54dd1e39adc7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19359: [SPARK-22129][SPARK-22138] Release script improvements
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19359 **[Test build #82216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82216/testReport)** for PR 19359 at commit [`5537787`](https://github.com/apache/spark/commit/553778785891cc973a28dc05e57414eb759d8f66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19330: [SPARK-18134][SQL] Orderable MapType
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19330 It seems the failed SparkR unit tests is not related. In current change, I added `trait OrderSpecified`, expressions(`BinaryComparison`, `Max`, `Min`, `SortArray`, `SortOrder`) using ordering of data type extends `OrderSpecified`. During analyzing, expressions with map type output will be wrapped in `OrderMaps` , thus changed to be an ordered map. Expressions with map type output in `Aggregate`, `Distinct` and `SetOperation` will also be wrapped in `OrderMaps`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19359: [SPARK-22129][SPARK-22138] Release script improve...
GitHub user holdenk opened a pull request: https://github.com/apache/spark/pull/19359 [SPARK-22129][SPARK-22138] Release script improvements ## What changes were proposed in this pull request? Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well. ## How was this patch tested? Rolled 2.1.2 RC2 You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-22129-fix-signing Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19359.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19359 commit 553778785891cc973a28dc05e57414eb759d8f66 Author: Holden KarauDate: 2017-09-27T02:41:00Z use GPG_KEY to control the GPG key. Add retry. Fix lsof to not be hard coded and just use path version. Remove version swapping on publish-release since we only need it some of the time and it was confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141234485 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -670,9 +670,12 @@ private[spark] class TaskSetManager( } if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId -abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") +abort(s""" + |Aborting $taskSet because task $indexInTaskSet (partition $partition) + |cannot run anywhere due to node and executor blacklist. + |Most recent failure: + |${taskSetBlacklist.getLatestFailureReason}\n --- End diff -- You don't have to add "\n", since you already use triple quoted format. Please trying to get understand the basic of this API before modifying it. Also verified in locally before pushing a new commit, and update the PR description to reflect your new format accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...
Github user goldmedal commented on the issue: https://github.com/apache/spark/pull/19339 @HyukjinKwon @viirya Thanks for your reviewing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17503: [SPARK-3159][MLlib] Check for reducible DecisionTree
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/17503 HI, @WeichenXu123. As said by @srowen , the benefit of this would be for speed at predict time or for model storage. Hence I'm not sure whether benchmark is really need for the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19338 **[Test build #82215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82215/testReport)** for PR 19338 at commit [`4d906b7`](https://github.com/apache/spark/commit/4d906b772ba1d893455d5b5676a898d81f035ad5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/19339#discussion_r141232322 --- Diff: python/pyspark/sql/readwriter.py --- @@ -420,7 +425,29 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non columnNameOfCorruptRecord=columnNameOfCorruptRecord, multiLine=multiLine) if isinstance(path, basestring): path = [path] -return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) +if type(path) == list: +return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) +elif isinstance(path, RDD): +def func(iterator): +for x in iterator: +if not isinstance(x, basestring): +x = unicode(x) +if isinstance(x, unicode): +x = x.encode("utf-8") +yield x +keyed = path.mapPartitions(func) +keyed._bypass_serializer = True +jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) +# see SPARK-22112 +# There aren't any jvm api for creating a dataframe from rdd storing csv. --- End diff -- Ok thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19339#discussion_r141231147 --- Diff: python/pyspark/sql/readwriter.py --- @@ -420,7 +425,29 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non columnNameOfCorruptRecord=columnNameOfCorruptRecord, multiLine=multiLine) if isinstance(path, basestring): path = [path] -return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) +if type(path) == list: +return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) +elif isinstance(path, RDD): +def func(iterator): +for x in iterator: +if not isinstance(x, basestring): +x = unicode(x) +if isinstance(x, unicode): +x = x.encode("utf-8") +yield x +keyed = path.mapPartitions(func) +keyed._bypass_serializer = True +jrdd = keyed._jrdd.map(self._spark._jvm.BytesToString()) +# see SPARK-22112 +# There aren't any jvm api for creating a dataframe from rdd storing csv. --- End diff -- Let's fix these comments like, ``` SPARK-22112: There aren't any jvm api for creating a dataframe from rdd storing csv. ... ``` or ``` There aren't any jvm api ... ... for creating a dataframe from dataset storing csv. See SPARK-22112. ``` when we happened to fix some code around here or review other PRs fixing some codes around here in the future. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19349: [SPARK-22125][PYSPARK][SQL] Enable Arrow Stream format f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19349 **[Test build #82214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82214/testReport)** for PR 19349 at commit [`7f6e43f`](https://github.com/apache/spark/commit/7f6e43f7b291b0e5f60914bcbf6a1a7dee92820f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19339 I've tested few times locally. Can't have the same failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19339 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19339 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19339 Thanks @goldmedal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141229038 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -671,8 +671,10 @@ private[spark] class TaskSetManager( if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") + s"cannot run anywhere due to node and executor blacklist.\n" + --- End diff -- Nice @jerryshao i will update as soon as possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19229: [SPARK-22001][ML][SQL] ImputerModel can do withColumn fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19229 ping @gatorsmile for the SQL part. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19168: [SPARK-21956][CORE] Fetch up to max bytes when bu...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19168#discussion_r141228856 --- Diff: core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala --- @@ -439,6 +443,8 @@ final class ShuffleBlockFetcherIterator( fetchRequests += FetchRequest(address, Array((blockId, size))) result = null } +bytesInFlight -= size +fetchUpToMaxBytes() --- End diff -- Thanks @cloud-fan IIUC, i think when `fetchfailedException` thrown there is no need to consider this.Which is also the original design. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Supports RDD of strings as input ...
Github user goldmedal commented on the issue: https://github.com/apache/spark/pull/19339 @HyukjinKwon I has updated this title. Thanks ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141226978 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -671,8 +671,10 @@ private[spark] class TaskSetManager( if (blacklistedEverywhere) { val partition = tasks(indexInTaskSet).partitionId abort(s"Aborting $taskSet because task $indexInTaskSet (partition $partition) " + - s"cannot run anywhere due to node and executor blacklist. Blacklisting behavior " + - s"can be configured via spark.blacklist.*.") + s"cannot run anywhere due to node and executor blacklist.\n" + --- End diff -- Can we change to Scala's triple quoted string interpolation here? you can refer to [here](https://github.com/apache/spark/blob/ceaec93839d18a20e0cd78b70f3ea71872dce0a4/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala#L211) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user caneGuy commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141226765 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala --- @@ -61,6 +61,16 @@ private[scheduler] class TaskSetBlacklist(val conf: SparkConf, val stageId: Int, private val blacklistedExecs = new HashSet[String]() private val blacklistedNodes = new HashSet[String]() + private var latestFailureReason: String = null + + /** + * Get the most recent failure reason of this TaskSet. + * @return + */ + def getLatestFailureReason: String = { --- End diff -- Thanks @jerryshao @squito Could you help trigger an other jenkins test?Since last one has pySpark failure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19338: [SPARK-22123][CORE] Add latest failure reason for task s...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19338 Thanks @squito i have updated the description --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19338: [SPARK-22123][CORE] Add latest failure reason for...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/19338#discussion_r141226456 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetBlacklist.scala --- @@ -61,6 +61,16 @@ private[scheduler] class TaskSetBlacklist(val conf: SparkConf, val stageId: Int, private val blacklistedExecs = new HashSet[String]() private val blacklistedNodes = new HashSet[String]() + private var latestFailureReason: String = null + + /** + * Get the most recent failure reason of this TaskSet. + * @return + */ + def getLatestFailureReason: String = { --- End diff -- @squito yes from scope level it is fine. My thought is that this exposes the class member to other class unnecessarily. Yeah it is not a big deal, just my personal preference. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/19218 @dongjoon-hyun Thank you very much, I'll fix them as soon as possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19351: [SPARK-22127][CORE]The Master Register Applicatio...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/19351#discussion_r141224687 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala --- @@ -265,6 +265,9 @@ private[deploy] class Master( val app = createApplication(description, driver) registerApplication(app) logInfo("Registered app " + description.name + " with ID " + app.id) +if(app.state == ApplicationState.WAITING) { --- End diff -- it is a warning. Because there is no resource allocation itself may lead to spark business has been waiting, there will be problems. Now I fix as shown belowï¼Do you think this is correct? if(app.state == ApplicationState.WAITING) { logWarning("App need to wait worker Launch executor," + "maybe worker does not have extra resources to allocate") } --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19346: [SAPRK-20785][WEB-UI][SQL] Spark should provide jump lin...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/19346 @gatorsmile Help to review the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19358: [SPARK-22135] metrics in spark-dispatcher not being regi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19358 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19327 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82211/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19358: [SPARK-22135] metrics in spark-dispatcher not bei...
GitHub user pmackles opened a pull request: https://github.com/apache/spark/pull/19358 [SPARK-22135] metrics in spark-dispatcher not being registered properly ## What changes were proposed in this pull request? Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called. ## How was this patch tested? Verified manually on local mesos setup You can merge this pull request into a Git repository by running: $ git pull https://github.com/pmackles/spark SPARK-22135 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19358.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19358 commit a3274191d0dd886dfe3044262543ff7f56bf851e Author: Paul MacklesDate: 2017-09-27T00:47:21Z [SPARK-22135] metrics in spark-dispatcher not being registered properly --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19327 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [SPARK-22136][SS] Implement stream-stream outer joins.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19327 **[Test build #82211 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82211/testReport)** for PR 19327 at commit [`a8aa356`](https://github.com/apache/spark/commit/a8aa35691d7603c0ca283dc2aff295031bcc3b9e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19357 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82213/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19357 **[Test build #82213 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82213/testReport)** for PR 19357 at commit [`46af54d`](https://github.com/apache/spark/commit/46af54d9c86fa1e5322fdd92ed47fe3d419dd966). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class NumericEquiHeightHgm(` * `case class StringEquiHeightHgm(` * `case class NumericEquiHeightBin(lowerBound: Double, upperBound: Double, binNdv: Long)` * `case class StringEquiHeightBin(upperBound: String, binNdv: Long)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19357 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19357 **[Test build #82213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82213/testReport)** for PR 19357 at commit [`46af54d`](https://github.com/apache/spark/commit/46af54d9c86fa1e5322fdd92ed47fe3d419dd966). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19339: [SPARK-22112][PYSPARK] Add an API to create a DataFrame ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19339 @goldmedal, are you online now? how about fixing the PR title to say something like .. "Supports RDD of strings as input in spark.read.csv in PySpark"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19357 Jenkins, add to whitelist --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: [SPARK-21322][SQL][WIP] support histogram in filter card...
Github user ron8hu commented on the issue: https://github.com/apache/spark/pull/19357 cc @wzhfy Please review code first before I request the community to review it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19335: mapPartitions Api
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19335 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19348: [BUILD] Close stale PRs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19348 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19347: Branch 2.2 sparkmlib's output of many algorithms ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19347 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19244: SPARK-22021
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19244 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18474: [SPARK-21235][TESTS] UTest should clear temp resu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18474 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18253: [SPARK-18838][CORE] Introduce multiple queues in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18253 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18978: [SPARK-21737][YARN]Create communication channel b...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18978 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18897: [SPARK-21655][YARN] Support Kill CLI for Yarn mod...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18897 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19334: Branch 1.6
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19334 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15009: [SPARK-17443][SPARK-11035] Stop Spark Application...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15009 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19315: [MINOR][ML]Updated english.txt word ordering
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19315 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19295: [SPARK-22080][SQL] Adds support for allowing user...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19295 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19356: Merge pull request #1 from apache/master
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19356 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19300: [SPARK-22082][SparkR]Spelling mistake: "choosen" ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19300 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19152: [SPARK-21915][ML][PySpark] Model 1 and Model 2 Pa...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19152 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19238: [SPARK-22016][SQL] Add HiveDialect for JDBC conne...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19238 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13794: [SPARK-15574][ML][PySpark] Python meta-algorithms...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13794 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19236: SPARK-22015: Remove usage of non-used private fie...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19236 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19357: support histogram in filter cardinality estimation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19357 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19327: [WIP] Implement stream-stream outer joins.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19327 **[Test build #82212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82212/testReport)** for PR 19327 at commit [`680c51a`](https://github.com/apache/spark/commit/680c51a9cedd1370dec9d249c2eba8cdfc61fbd8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19348: [BUILD] Close stale PRs
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19348 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org