[GitHub] spark pull request: [SPARK-14345][SQL] Decouple deserializer expre...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12131 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14345][SQL] Decouple deserializer expre...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/12131#issuecomment-205918635 LGTM, thanks for improving the comments. Its much clearer to me what is happing now!] Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14397][WEBUI] and tags ar...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12170 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...
Github user pravingadakh commented on the pull request: https://github.com/apache/spark/pull/12176#issuecomment-205918384 I could document the returned values, but frankly I have no idea what those values are. I can see the documentation of first returned value `gammad` in the comments but none for others. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14397][WEBUI] and tags ar...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/12170#issuecomment-205917914 LGTM. Merging to master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13538][ML] Add GaussianMixture to ML
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/11419#issuecomment-205916377 Haha OK thanks. I just sent a PR to update this PR: [https://github.com/zhengruifeng/spark/pull/1] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/12087#discussion_r58582999 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.function.MapFunction +import org.apache.spark.util.Benchmark + +/** + * Benchmark for Dataset typed operations. + */ +object DatasetBenchmark { + + case class Data(i: Int, s: String) + + def main(args: Array[String]): Unit = { +val sparkContext = new SparkContext("local[*]", "benchmark") +val sqlContext = new SQLContext(sparkContext) + +import sqlContext.implicits._ + +val numRows = 1000 +val ds = sqlContext.range(numRows).map(l => Data(l.toInt, l.toString)) +ds.cache() +ds.collect() // make sure data are cached + +val benchmark = new Benchmark("Dataset.map", numRows) + +val scalaFunc = (d: Data) => Data(d.i + 1, d.s) +benchmark.addCase("scala function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(scalaFunc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +val javaFunc = new MapFunction[Data, Data] { + override def call(d: Data): Data = Data(d.i + 1, d.s) +} +val enc = implicitly[Encoder[Data]] +benchmark.addCase("java function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(javaFunc, enc) +i += 1 + } + res.queryExecution.toRdd.count() --- End diff -- Okay... its harder than I thought to run it on master :( At least base the benchmark on these tests: https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/DatasetPerformance.scala (back to back maps, compare with RDDs) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14284][ML] KMeansSummary deprecating si...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12084 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13048][ML][MLLIB] keepLastCheckpoint op...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12166#issuecomment-205914706 **[Test build #55001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55001/consoleFull)** for PR 12166 at commit [`59904c4`](https://github.com/apache/spark/commit/59904c441a57a22465e3a2b338f1867ad97f5bdd). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14290][CORE][Network] avoid significant...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/12083#issuecomment-205914342 @liyezhang556520 I like the idea of eargerly copying into a direct buffer, but understand that might be a lot of code for not much gain. I still think we should reduce that limit though - maybe 256k? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14284][ML] KMeansSummary deprecating si...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/12084#issuecomment-205914146 LGTM Merging with master Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13048][ML][MLLIB] keepLastCheckpoint op...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/12166#discussion_r58582117 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -619,6 +651,31 @@ class DistributedLDAModel private[ml] ( @Since("1.6.0") lazy val logPrior: Double = oldDistributedModel.logPrior + private var _checkpointFiles: Array[String] = oldDistributedModel.checkpointFiles + + /** + * If using checkpointing and [[LDA.keepLastCheckpoint]] is set to true, then there may be + * saved checkpoint files. This method is provided so that users can manage those files. + * Note that removing the checkpoints can cause failures if a partition is lost and is needed + * by certain [[DistributedLDAModel]] methods. + * + * @return Checkpoint files from training + */ + @Since("2.0.0") + def getCheckpointFiles: Array[String] = _checkpointFiles --- End diff -- I'd like to give users a way to clean up manually, but I'll mark it as DeveloperApi. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13048][ML][MLLIB] keepLastCheckpoint op...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/12166#discussion_r58582132 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -758,6 +816,10 @@ class LDA @Since("1.6.0") ( @Since("1.6.0") def setOptimizeDocConcentration(value: Boolean): this.type = set(optimizeDocConcentration, value) + /** @group expertSetParam */ + @Since("2.0.0") + def setKeepLastCheckpoint(value: Boolean): this.type = set(keepLastCheckpoint, value) + --- End diff -- I don't think so. Once there is a model, the decision about keeping the last checkpoint has already been made. Users can manage the checkpoint via the deletion method though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13048][ML][MLLIB] keepLastCheckpoint op...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/12166#issuecomment-205913961 Updated! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13048][ML][MLLIB] keepLastCheckpoint op...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/12166#discussion_r58582125 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -258,7 +265,30 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM def getOptimizeDocConcentration: Boolean = $(optimizeDocConcentration) /** + * For EM optimizer, if using checkpointing, this indicates whether to keep the last + * checkpoint. If false, then the checkpoint will be deleted. Deleting the checkpoint can + * cause failures if a data partition is lost, so set this bit with care. + * + * See [[DistributedLDAModel.getCheckpointFiles]] for getting remaining checkpoints and + * [[DistributedLDAModel.deleteCheckpointFiles]] for removing remaining checkpoints. + * --- End diff -- Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14402][SQL] initcap UDF doesn't match H...
Github user dongjoon-hyun commented on the pull request: https://github.com/apache/spark/pull/12175#issuecomment-205913234 Thank you, @srowen ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14402][SQL] initcap UDF doesn't match H...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/12175#issuecomment-205910764 I think that's pretty reasonable as a minimally invasive fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12049#issuecomment-205910352 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12049#issuecomment-205910356 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54994/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12049#issuecomment-205909806 **[Test build #54994 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54994/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class StreamingRelation(dataSource: DataSource, sourceName: String, output: Seq[Attribute])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-14402][CORE] initcap UDF doesn't m...
Github user dongjoon-hyun commented on the pull request: https://github.com/apache/spark/pull/12175#issuecomment-205902962 Hi, @srowen . I minimized the change on master. * Undo the changes on `common` module. * Implement `initCap` by the following changes. ``` override def nullSafeEval(string: Any): Any = { -string.asInstanceOf[UTF8String].toTitleCase +string.asInstanceOf[UTF8String].toLowerCase.toTitleCase } override def genCode(ctx: CodegenContext, ev: ExprCode): String = { -defineCodeGen(ctx, ev, str => s"$str.toTitleCase()") +defineCodeGen(ctx, ev, str => s"$str.toLowerCase().toTitleCase()") } ``` I think it's enough for `initCap` function as a small fix for now. How do you think about this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/12087#discussion_r58577527 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.function.MapFunction +import org.apache.spark.util.Benchmark + +/** + * Benchmark for Dataset typed operations. + */ +object DatasetBenchmark { + + case class Data(i: Int, s: String) + + def main(args: Array[String]): Unit = { +val sparkContext = new SparkContext("local[*]", "benchmark") +val sqlContext = new SQLContext(sparkContext) + +import sqlContext.implicits._ + +val numRows = 1000 +val ds = sqlContext.range(numRows).map(l => Data(l.toInt, l.toString)) +ds.cache() +ds.collect() // make sure data are cached + +val benchmark = new Benchmark("Dataset.map", numRows) + +val scalaFunc = (d: Data) => Data(d.i + 1, d.s) +benchmark.addCase("scala function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(scalaFunc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +val javaFunc = new MapFunction[Data, Data] { + override def call(d: Data): Data = Data(d.i + 1, d.s) +} +val enc = implicitly[Encoder[Data]] +benchmark.addCase("java function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(javaFunc, enc) +i += 1 + } + res.queryExecution.toRdd.count() --- End diff -- This is doing the expensive conversion to external rows. Could you try to run the existing benchmark which avoids this and also compares against RDDs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-14402][CORE] initcap UDF doesn't m...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12175#issuecomment-205902930 **[Test build #55000 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55000/consoleFull)** for PR 12175 at commit [`69c6e1c`](https://github.com/apache/spark/commit/69c6e1c5e02ef40ca8c1ba9c04cfa01e43386c5f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12178#issuecomment-205900906 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14129][SQL] Alter table DDL commands
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/12121#issuecomment-205901169 I think this PR also resolves another JIRA: https://issues.apache.org/jira/browse/SPARK-14128 Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12178#issuecomment-205900843 **[Test build #54997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54997/consoleFull)** for PR 12178 at commit [`64b7cf4`](https://github.com/apache/spark/commit/64b7cf487c59aee3217aae37733ff9879dd79c2c). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class OutputWriterFactory extends Serializable ` * `abstract class OutputWriter ` * `case class HadoopFsRelation(` * `trait FileFormat ` * `case class Partition(values: InternalRow, files: Seq[FileStatus])` * `trait FileCatalog ` * `class HDFSFileCatalog(` * ` case class FakeFileStatus(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12178#issuecomment-205900916 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54997/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12087#discussion_r58576642 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.function.MapFunction +import org.apache.spark.util.Benchmark + +/** + * Benchmark for Dataset typed operations. + */ +object DatasetBenchmark { + + case class Data(i: Int, s: String) + + def main(args: Array[String]): Unit = { +val sparkContext = new SparkContext("local[*]", "benchmark") +val sqlContext = new SQLContext(sparkContext) + +import sqlContext.implicits._ + +val numRows = 1000 +val ds = sqlContext.range(numRows).map(l => Data(l.toInt, l.toString)) +ds.cache() +ds.collect() // make sure data are cached + +val benchmark = new Benchmark("Dataset.map", numRows) + +val scalaFunc = (d: Data) => Data(d.i + 1, d.s) +benchmark.addCase("scala function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(scalaFunc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +val javaFunc = new MapFunction[Data, Data] { + override def call(d: Data): Data = Data(d.i + 1, d.s) +} +val enc = implicitly[Encoder[Data]] +benchmark.addCase("java function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(javaFunc, enc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz +Dataset.map:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + --- +scala function 1029 / 1080 9.7 102.9 1.0X +java function 965 / 999 10.4 96.5 1.1X --- End diff -- Could you have one more test case to show chained function ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205897396 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205897404 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54993/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/12081#issuecomment-205893057 cc @andrewor14 @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14353] Dataset Time Window `window` API...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12136#issuecomment-205896645 **[Test build #54999 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54999/consoleFull)** for PR 12136 at commit [`1bd7563`](https://github.com/apache/spark/commit/1bd7563ced8dca52f4156339d86b4d01535fdf58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-205896642 **[Test build #54998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54998/consoleFull)** for PR 12153 at commit [`5fef611`](https://github.com/apache/spark/commit/5fef61170f5b63845f8b5ee72fb3413e1ce4477d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12569][PySpark][ML]:DecisionTreeRegress...
Github user wangmiao1981 commented on the pull request: https://github.com/apache/spark/pull/12116#issuecomment-205896732 @holdenk Thanks for your comments! I will make changes accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12178#issuecomment-205896647 **[Test build #54997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54997/consoleFull)** for PR 12178 at commit [`64b7cf4`](https://github.com/apache/spark/commit/64b7cf487c59aee3217aae37733ff9879dd79c2c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205896549 **[Test build #54993 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54993/consoleFull)** for PR 12138 at commit [`8a18acd`](https://github.com/apache/spark/commit/8a18acd6c14c346e409f7da709e7f2d33f6e4662). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TimestampFromLong(child: Expression) extends UnaryExpression with ExpectsInputTypes ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12087#discussion_r58575785 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.function.MapFunction +import org.apache.spark.util.Benchmark + +/** + * Benchmark for Dataset typed operations. + */ +object DatasetBenchmark { + + case class Data(i: Int, s: String) + + def main(args: Array[String]): Unit = { +val sparkContext = new SparkContext("local[*]", "benchmark") +val sqlContext = new SQLContext(sparkContext) + +import sqlContext.implicits._ + +val numRows = 1000 +val ds = sqlContext.range(numRows).map(l => Data(l.toInt, l.toString)) +ds.cache() +ds.collect() // make sure data are cached + +val benchmark = new Benchmark("Dataset.map", numRows) + +val scalaFunc = (d: Data) => Data(d.i + 1, d.s) +benchmark.addCase("scala function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(scalaFunc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +val javaFunc = new MapFunction[Data, Data] { + override def call(d: Data): Data = Data(d.i + 1, d.s) +} +val enc = implicitly[Encoder[Data]] +benchmark.addCase("java function") { iter => + var res = ds + var i = 0 + while (i < 10) { +res = res.map(javaFunc, enc) +i += 1 + } + res.queryExecution.toRdd.count() +} + +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz +Dataset.map:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + --- +scala function 1029 / 1080 9.7 102.9 1.0X +java function 965 / 999 10.4 96.5 1.1X --- End diff -- This is very slow, range/filter/aggregate just take a few nano seconds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12178 [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution package ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into execution package. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14407-hide-file-scan-api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12178 commit 64b7cf487c59aee3217aae37733ff9879dd79c2c Author: Cheng LianDate: 2016-04-05T16:58:52Z Hides HadoopFsRelation related data source API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/12087#discussion_r58575623 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.SparkContext +import org.apache.spark.api.java.function.MapFunction +import org.apache.spark.util.Benchmark + +/** + * Benchmark for Dataset typed operations. + */ +object DatasetBenchmark { + + case class Data(i: Int, s: String) + + def main(args: Array[String]): Unit = { +val sparkContext = new SparkContext("local[*]", "benchmark") +val sqlContext = new SQLContext(sparkContext) + +import sqlContext.implicits._ + +val numRows = 1000 +val ds = sqlContext.range(numRows).map(l => Data(l.toInt, l.toString)) +ds.cache() --- End diff -- Why cache here? uncached range() should be faster than cached. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205893715 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12081#issuecomment-205893818 **[Test build #54996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54996/consoleFull)** for PR 12081 at commit [`16ac0b1`](https://github.com/apache/spark/commit/16ac0b1a548fedf7f602097ebb4aa1e7ed285515). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205893720 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54995/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/12081#issuecomment-205893108 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205893381 **[Test build #54995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54995/consoleFull)** for PR 11989 at commit [`bebd544`](https://github.com/apache/spark/commit/bebd544bf411717ac22899f79627b0811b1da8c5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14396] [SQL] Throw Exceptions for DDLs ...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/12169#issuecomment-205892839 Thanks for the review! @hvanhovell Also cc @yhuai @andrewor14 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6429] Implement hashCode and equals tog...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/12157#discussion_r58572140 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -53,14 +53,22 @@ import org.apache.spark.util.{NextIterator, SerializableConfiguration, ShutdownH /** * A Spark split class that wraps around a Hadoop InputSplit. */ -private[spark] class HadoopPartition(rddId: Int, idx: Int, s: InputSplit) +private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: InputSplit) extends Partition { val inputSplit = new SerializableWritable[InputSplit](s) - override def hashCode(): Int = 41 * (41 + rddId) + idx + override def hashCode(): Int = 41 * (41 + rddId) + index - override val index: Int = idx + def canEqual(other: Any): Boolean = other.isInstanceOf[HadoopPartition] + + override def equals(other: Any): Boolean = other match { +case that: HadoopPartition => + super.equals(that) && +(that canEqual this) && +index == that.index --- End diff -- It becomes a field if you use it outside the constructor, or should. It should stay private yes. This caused me to think about the partition changes a bit more. Defining `hashCode` without `equals` is technically correct. By default no two distinct objects are equal, so they can't violate the contract that equal objects have the same hash code no matter what the hash code is. Here it's not clear that two partitions with the same index are semantically equivalent, since they can be from different RDDs. So the RDD's ID matters, but, maybe it's not right to implement a notion of equality here either. It does raise the question -- when do partitions get hashed and why is a non-default implementation important then? it could be vestigial. Maybe best not to add equals though, and unless we know for sure hash code isn't important, leave that. So maybe we end up leaving the partition classes alone and weakening the condition to require hashCode if equals exists but not vice versa? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/12176#issuecomment-205882669 A-ha, understood about `.values`. This looks pretty reasonable. My only question is, does it make sense conceptually that this method also returns a list of IDs? it doesn't hurt much in practice, and it seems like there's a reasonable argument for it logically. We have one case where the caller needs it after all. Maybe finish this by documenting the three things returned from this method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14245] [Web UI] Display the user in the...
Github user ajbozarth commented on the pull request: https://github.com/apache/spark/pull/12123#issuecomment-205882481 I'll add it the history server later today then, I didn't re-test on the history server after the change so I didn't notice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10530] [CORE] Kill other task attempts ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/11996#discussion_r58569079 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala --- @@ -789,6 +791,51 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(TaskLocation("executor_host1_3") === ExecutorCacheTaskLocation("host1", "3")) } + test("Kill other task attempts when one attempt belonging to the same task succeeds") { +sc = new SparkContext("local", "test") +val sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2")) +val taskSet = FakeTask.createTaskSet(4) +val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES) +val accumUpdatesByTask: Array[Seq[AccumulableInfo]] = taskSet.tasks.map { task => + task.initialAccumulators.map { a => a.toInfo(Some(0L), None) } +} +// Offer resources for 4 tasks to start +for ((k, v) <- List( +"exec1" -> "host1", +"exec1" -> "host1", +"exec2" -> "host2", +"exec2" -> "host2")) { + val taskOption = manager.resourceOffer(k, v, NO_PREF) + assert(taskOption.isDefined) + val task = taskOption.get + assert(task.executorId === k) +} +assert(sched.startedTasks.toSet === Set(0, 1, 2, 3)) +// Complete the 3 tasks and leave 1 task in running +for (id <- Set(0, 1, 2)) { + manager.handleSuccessfulTask(id, createTaskResult(id, accumUpdatesByTask(id))) + assert(sched.endedTasks(id) === Success) +} + +// Wait for the threshold time to start speculative attempt for the running task +Thread.sleep(100) --- End diff -- ah you are right, sorry looked at the wrong config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205881415 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205881421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54990/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205880601 **[Test build #54990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54990/consoleFull)** for PR 12087 at commit [`5a96ae4`](https://github.com/apache/spark/commit/5a96ae46b1a9f697b9541ae4abc408069b747315). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-205878760 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205878822 **[Test build #54995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54995/consoleFull)** for PR 11989 at commit [`bebd544`](https://github.com/apache/spark/commit/bebd544bf411717ac22899f79627b0811b1da8c5). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14123] [SPARK-14384] [SQL] Handle Creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12117#issuecomment-205878605 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54987/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-205878765 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54988/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14123] [SPARK-14384] [SQL] Handle Creat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12117#issuecomment-205878599 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-205878364 **[Test build #54988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54988/consoleFull)** for PR 12153 at commit [`a1f527a`](https://github.com/apache/spark/commit/a1f527aa2c91776085818c360f9ea3d0a8d5d616). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14123] [SPARK-14384] [SQL] Handle Creat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12117#issuecomment-205878276 **[Test build #54987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54987/consoleFull)** for PR 12117 at commit [`3938766`](https://github.com/apache/spark/commit/39387666fe24a2e65a267eabc15c7816c8738449). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user yongtang commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205877757 @sethah Thanks. The import has been removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12049#issuecomment-205871057 **[Test build #54994 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54994/consoleFull)** for PR 12049 at commit [`48d760e`](https://github.com/apache/spark/commit/48d760eed41a3d559ad8aa6363b6000d4b9ed54d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14257][SQL]Allow multiple continuous qu...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/12049#issuecomment-205869083 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14335][SQL] Describe function command r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12128#issuecomment-205868721 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14335][SQL] Describe function command r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12128#issuecomment-205868723 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54986/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14335][SQL] Describe function command r...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12128#issuecomment-205868179 **[Test build #54986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54986/consoleFull)** for PR 12128 at commit [`927272c`](https://github.com/apache/spark/commit/927272cdc79b7fd8908faea80e8d842d40c2b468). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14362] [SPARK-14406] [SQL] [WIP] DDL Na...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12146#issuecomment-205866052 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54985/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14362] [SPARK-14406] [SQL] [WIP] DDL Na...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12146#issuecomment-205866050 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10530] [CORE] Kill other task attempts ...
Github user devaraj-kavali commented on a diff in the pull request: https://github.com/apache/spark/pull/11996#discussion_r58563479 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala --- @@ -789,6 +791,51 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(TaskLocation("executor_host1_3") === ExecutorCacheTaskLocation("host1", "3")) } + test("Kill other task attempts when one attempt belonging to the same task succeeds") { +sc = new SparkContext("local", "test") +val sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2")) +val taskSet = FakeTask.createTaskSet(4) +val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES) +val accumUpdatesByTask: Array[Seq[AccumulableInfo]] = taskSet.tasks.map { task => + task.initialAccumulators.map { a => a.toInfo(Some(0L), None) } +} +// Offer resources for 4 tasks to start +for ((k, v) <- List( +"exec1" -> "host1", +"exec1" -> "host1", +"exec2" -> "host2", +"exec2" -> "host2")) { + val taskOption = manager.resourceOffer(k, v, NO_PREF) + assert(taskOption.isDefined) + val task = taskOption.get + assert(task.executorId === k) +} +assert(sched.startedTasks.toSet === Set(0, 1, 2, 3)) +// Complete the 3 tasks and leave 1 task in running +for (id <- Set(0, 1, 2)) { + manager.handleSuccessfulTask(id, createTaskResult(id, accumUpdatesByTask(id))) + assert(sched.endedTasks(id) === Success) +} + +// Wait for the threshold time to start speculative attempt for the running task +Thread.sleep(100) --- End diff -- Thanks @tgravescs for your quick response. Here Thread.sleep(100) is to match the threshold value mentioned in TaskSetManager.checkSpeculatableTasks(). It is the minimum time where the task needs to run for this much of time before becoming eligible for launching a speculative attempt. I don't see any way to change this default value. > val medianDuration = durations(min((0.5 * tasksSuccessful).round.toInt, durations.length - 1)) > val threshold = max(SPECULATION_MULTIPLIER * medianDuration, 100) > I don't think this threshold value is related to the config âspark.speculation.intervalâ here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14362] [SPARK-14406] [SQL] [WIP] DDL Na...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12146#issuecomment-205865589 **[Test build #54985 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54985/consoleFull)** for PR 12146 at commit [`5393174`](https://github.com/apache/spark/commit/5393174badbccc1794243ac6e18c424f8f062cf7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14362] [SPARK-14406] [SQL] [WIP] DDL Na...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/12146#issuecomment-205861283 @yhuai Sure, will do it in this PR. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user sethah commented on the pull request: https://github.com/apache/spark/pull/11989#issuecomment-205857272 This LGTM other than one small comment about imports. @MLnick could you make a final pass? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14399] Remove unnecessary excludes from...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/12171#issuecomment-205858120 It looks like Hive might actually need Joda time: ``` - analyze MetastoreRelations *** FAILED *** org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/joda/time/ReadWritableInstant at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:455) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:440) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:173) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:172) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:215) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:440) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:430) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:351) at org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:183) ... 01:36:40.566 ERROR hive.ql.exec.DDLTask: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hive.serde2.lazy.objectinspector.primitive.LazyPrimitiveObjectInspectorFactory ``` https://stackoverflow.com/questions/26259717/facing-java-lang-noclassdeffounderror-org-joda-time-readableinstant-error-even --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3724][ML] RandomForest: More options fo...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11989#discussion_r58558671 --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala --- @@ -27,6 +27,7 @@ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite, EnsembleTestHelper} import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, QuantileStrategy, Strategy => OldStrategy} import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, GiniCalculator} +import org.apache.spark.mllib.tree.model.RandomForestModel --- End diff -- Not sure why this import was added. It can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14396] [SQL] Throw Exceptions for DDLs ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12169#issuecomment-205855786 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14396] [SQL] Throw Exceptions for DDLs ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12169#issuecomment-205855791 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54983/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14396] [SQL] Throw Exceptions for DDLs ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12169#issuecomment-205855383 **[Test build #54983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54983/consoleFull)** for PR 12169 at commit [`140f859`](https://github.com/apache/spark/commit/140f85998953f1d945df4f318ac0a88d197583cd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205853906 **[Test build #54993 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54993/consoleFull)** for PR 12138 at commit [`8a18acd`](https://github.com/apache/spark/commit/8a18acd6c14c346e409f7da709e7f2d33f6e4662). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205851532 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205850589 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205850592 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54992/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205850563 **[Test build #54992 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54992/consoleFull)** for PR 12087 at commit [`a5b0d57`](https://github.com/apache/spark/commit/a5b0d57bb7bce985771ede00b0f116a4cce82900). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205848733 benchmark added, the result is included in the benchmark code. I also ran this benchmark against master branch, the result is: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11.4 Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz Dataset.map:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- scala function 2471 / 2531 4.0 247.1 1.0X java function2416 / 2478 4.1 241.6 1.0X ``` So with whole stage codegen, we can get about 2.5 times speed up! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205848612 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205848622 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54991/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205848558 **[Test build #54991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54991/consoleFull)** for PR 12138 at commit [`8a18acd`](https://github.com/apache/spark/commit/8a18acd6c14c346e409f7da709e7f2d33f6e4662). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TimestampFromLong(child: Expression) extends UnaryExpression with ExpectsInputTypes ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205847599 **[Test build #54992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54992/consoleFull)** for PR 12087 at commit [`a5b0d57`](https://github.com/apache/spark/commit/a5b0d57bb7bce985771ede00b0f116a4cce82900). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...
Github user pravingadakh commented on a diff in the pull request: https://github.com/apache/spark/pull/12176#discussion_r58553918 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -542,10 +539,11 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double]) = { -val (ids: List[Int], cts: Array[Double]) = termCounts match { - case v: DenseVector => ((0 until v.size).toList, v.values) - case v: SparseVector => (v.indices.toList, v.values) + k: Int, + ids: List[Int]): (BDV[Double], BDM[Double]) = { +val cts: Array[Double] = termCounts match { --- End diff -- Yes it looks redundant, but `values` is not a member of parent trait `Vector`, it is defined at individual implementation level (i.e. `DenseVector` and `SparseVector`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/12138#discussion_r58553204 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -1659,11 +1665,12 @@ object TimeWindowing extends Rule[LogicalPlan] { val windowEnd = windowStart + window.windowDuration CreateNamedStruct( -Literal(WINDOW_START) :: windowStart :: --- End diff -- Previously we manually set the output of Expand here as `TimestampType` (`windowAttr`). As `windowStart` and `windowEnd` are producing long values, when we infer output from Expand's projections, we will get `LongType` instead of `TimestampType`. So we need to explicitly convert the `LongType` to `TimestampType`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...
Github user pravingadakh commented on a diff in the pull request: https://github.com/apache/spark/pull/12176#discussion_r58553056 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -440,12 +440,9 @@ final class OnlineLDAOptimizer extends LDAOptimizer { val stat = BDM.zeros[Double](k, vocabSize) var gammaPart = List[BDV[Double]]() nonEmptyDocs.foreach { case (_, termCounts: Vector) => -val ids: List[Int] = termCounts match { - case v: DenseVector => (0 until v.size).toList - case v: SparseVector => v.indices.toList -} +val ids: List[Int] = LDAUtils.vectorAsList(termCounts) --- End diff -- Yes I agree. It is better to simply return `ids` rather than having callers to compute it. I will make that change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14354][SQL] Let Expand take name expres...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12138#issuecomment-205845316 **[Test build #54991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54991/consoleFull)** for PR 12138 at commit [`8a18acd`](https://github.com/apache/spark/commit/8a18acd6c14c346e409f7da709e7f2d33f6e4662). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12566] [ML] [WIP] GLM model family, lin...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11549#issuecomment-205844978 @jkbradley @hhbyyh I can work on this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14245] [Web UI] Display the user in the...
Github user sarutak commented on the pull request: https://github.com/apache/spark/pull/12123#issuecomment-205844392 Yeah, we can see the user name in the history page and it's gotten using the REST API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14298] [ML] [MLlib] LDA should support ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12089#issuecomment-205843565 @jkbradley Agree, I will update the PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][WIP][test-hadoop2.2] Locali...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-205834944 **[Test build #54988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54988/consoleFull)** for PR 12153 at commit [`a1f527a`](https://github.com/apache/spark/commit/a1f527aa2c91776085818c360f9ea3d0a8d5d616). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14404][SQL][TESTS] HDFSMetadataLogSuite...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12177#issuecomment-205837805 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14204] [SQL] register driverClass rathe...
Github user mchalek commented on the pull request: https://github.com/apache/spark/pull/12000#issuecomment-205840895 Bump. Would be nice to get closure on this. I doubt that we were in the minority in being affected by this (although admittedly, complaints of other failures seem not to have yet surfaced). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205840164 **[Test build #54990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54990/consoleFull)** for PR 12087 at commit [`5a96ae4`](https://github.com/apache/spark/commit/5a96ae46b1a9f697b9541ae4abc408069b747315). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14362] [SQL] DDL Native Support: Drop V...
Github user yhuai commented on the pull request: https://github.com/apache/spark/pull/12146#issuecomment-205839654 @gatorsmile Thank you for working on it. I feel changes of DropTable command should also make this command natively supported by Spark (no runSqlHive)? If so, we can also resolve https://issues.apache.org/jira/browse/SPARK-14406? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205839287 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14296][SQL] whole stage codegen support...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12087#issuecomment-205839240 **[Test build #54989 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54989/consoleFull)** for PR 12087 at commit [`1707c21`](https://github.com/apache/spark/commit/1707c21ac7b25824c180ee9142372cefdc87fa5c). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org