[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user Sherry302 commented on the issue: https://github.com/apache/spark/pull/14659 Hi, @steveloughran Thank you very much for the comments. I have created an Hadoop jira [HADOOP-13527 ](https://issues.apache.org/jira/browse/HADOOP-13527) and attached the patch, could you please review it? I am unable to assign the jira to me, could you please add me as âcontributorâ role in Hadoop? Thanks again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [MINOR][ML]Add expert param support to SharedParamsCodeG...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14738 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [MINOR][ML]Add expert param support to SharedParamsCodeG...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14738 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64159/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [MINOR][ML]Add expert param support to SharedParamsCodeG...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14738 **[Test build #64159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64159/consoleFull)** for PR 14738 at commit [`ba6d731`](https://github.com/apache/spark/commit/ba6d73116f92385fc4d0d9fed8aaf3aab7e5a6a4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14738: [MINOR][ML]Add expert param support to SharedParamsCodeG...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14738 **[Test build #64159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64159/consoleFull)** for PR 14738 at commit [`ba6d731`](https://github.com/apache/spark/commit/ba6d73116f92385fc4d0d9fed8aaf3aab7e5a6a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14738: [MINOR][ML]Add expert param support to SharedPara...
GitHub user hqzizania opened a pull request: https://github.com/apache/spark/pull/14738 [MINOR][ML]Add expert param support to SharedParamsCodeGen ## What changes were proposed in this pull request? Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hqzizania/spark SPARK-17090-minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14738.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14738 commit 6c2c514073a05d578e4ca1bb5c120506b58ce72d Author: hqzizaniaDate: 2016-08-19T17:47:24Z add aggregationDepth to SharedParamsCodeGen commit cc37a89308ab7c4064f84b6248d0d6888ba9e64f Author: hqzizania Date: 2016-08-21T03:19:21Z Merge remote-tracking branch 'origin/master' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14625: [SPARK-17045] [SQL] Build/move Join-related test ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14625#discussion_r75588865 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -245,6 +245,10 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { (1 to 100).map(i => (i, i.toString)).toDF("key", "value").createOrReplaceTempView("testdata") +Seq((1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)) --- End diff -- To be honest, it is hard to write test data, especially when we want very few rows in each data set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14625: [SPARK-17045] [SQL] Build/move Join-related test ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14625#discussion_r75588856 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -245,6 +245,10 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { (1 to 100).map(i => (i, i.toString)).toDF("key", "value").createOrReplaceTempView("testdata") +Seq((1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)) --- End diff -- The major differences are the data. They have different data distribution. For example, testData` does not have duplicate key values, but `testData2` has fewer rows and duplicate key values. `src1` has null but `src` does not have it. Your concern is valid. We should change the name; otherwise, it is hard to understand the reasons. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14712 Spark SQL already has its own metastore: `InMemoryCatalog`. And we do have an abstraction for metasotre: `ExternalCatalog`. We have 2 targets here: 1. add table statistics in Spark SQL 2. Spark SQL and Hive should recognize table statistics from each other. I think target 1 is more important, and we do need an implementation that not depend on hive features. > Actually, we desperately need spark sql to have its own metastore, because we need to persist statistics like histograms which AFAIK hive metastore doesn't support. We store table statistics in table properties, why would hive metastore not support it? Do you mean Hive can't recognize it? But I think it's ok, we should not limit our table statistics by what Hive supports. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14625: [SPARK-17045] [SQL] Build/move Join-related test ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14625#discussion_r75588814 --- Diff: sql/core/src/test/resources/sql-tests/inputs/join.sql --- @@ -0,0 +1,225 @@ +-- join nested table expressions (auto_join0.q) --- End diff -- : ) That is for helping reviewers know the origins of the queries. If you think we do not care, we can remove it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14601: [SPARK-13979][Core] Killed executor is re spawned...
Github user agsachin commented on a diff in the pull request: https://github.com/apache/spark/pull/14601#discussion_r75588799 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -107,6 +107,14 @@ class SparkHadoopUtil extends Logging { if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } + // Copy any "fs.swift2d.foo=bar" properties into conf as "fs.swift2d.foo=bar" --- End diff -- thats nice suggestion we add configs for azure also. I am not familiar with azure do you have a sample code that to understand and run and test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14601: [SPARK-13979][Core] Killed executor is re spawned...
Github user agsachin commented on a diff in the pull request: https://github.com/apache/spark/pull/14601#discussion_r75588790 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -102,11 +102,19 @@ class SparkHadoopUtil extends Logging { hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3a.secret.key", accessKey) } - // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" conf.getAll.foreach { case (key, value) => +// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } + // Copy any "fs.swift2d.foo=bar" properties into conf as "fs.swift2d.foo=bar" +else if (key.startsWith("fs.swift2d")){ + hadoopConf.set(key, value) --- End diff -- this is have added as I was using https://github.com/SparkTC/stocator. now I have updated for `hadoop-openstack` also --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14737: [Spark-17171][WEB UI] DAG will list all partitions in th...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14737 **[Test build #64158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64158/consoleFull)** for PR 14737 at commit [`595453f`](https://github.com/apache/spark/commit/595453fbb2ccdd4009821724adefb829a13890c7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14737: [Spark-17171][WEB UI] DAG will list all partition...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/14737 [Spark-17171][WEB UI] DAG will list all partitions in the graph ## What changes were proposed in this pull request? DAG will list all partitions in the graph, it is too slow and hard to see all graph. Always we don't want to see all partitionsï¼we just want to see the relations of DAG graph. So I just show 2 root nodes for Rdds. Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-17171 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14737.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14737 commit 7991d7622260bc8e65ee9b934d376df2597c9a11 Author: cenyuhaiDate: 2016-08-20T15:44:38Z Just show 2 root partitions for a stage commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit 595453fbb2ccdd4009821724adefb829a13890c7 Author: cenyuhai Date: 2016-08-21T04:06:06Z Merge remote-tracking branch 'remotes/origin/master' into SPARK-17171 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...
Github user wzhfy commented on the issue: https://github.com/apache/spark/pull/14712 I suggest in the current stage, we still follow hive's convention. When spark sql has its own metastore, we can bridge between these two metastores by a mapping between two different sets of names/data structures, and then provide a config for users to declare their preference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user hqzizania commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75588359 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -389,4 +389,21 @@ private[ml] trait HasSolver extends Params { /** @group getParam */ final def getSolver: String = $(solver) } + +/** + * Trait for shared param aggregationDepth (default: 2). + */ +private[ml] trait HasAggregationDepth extends Params { + + /** + * Param for suggested depth for treeAggregate (>= 2). + * @group param --- End diff -- OK, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75588252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -389,4 +389,21 @@ private[ml] trait HasSolver extends Params { /** @group getParam */ final def getSolver: String = $(solver) } + +/** + * Trait for shared param aggregationDepth (default: 2). + */ +private[ml] trait HasAggregationDepth extends Params { + + /** + * Param for suggested depth for treeAggregate (>= 2). + * @group param --- End diff -- This is very small. You can just submit a PR with minor in title without going through the JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14682: [SPARK-17104][SQL] LogicalRelation.newInstance should fo...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14682 @cloud-fan Thank you for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14712: [SPARK-17072] [SQL] support table-level statistic...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14712#discussion_r75588230 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala --- @@ -33,7 +34,7 @@ import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, CatalogTable} * Right now, it only supports Hive tables and it only updates the size of a Hive table * in the Hive metastore. */ -case class AnalyzeTableCommand(tableName: String) extends RunnableCommand { +case class AnalyzeTableCommand(tableName: String, noscan: Boolean = true) extends RunnableCommand { override def run(sparkSession: SparkSession): Seq[Row] = { --- End diff -- Not related to this PR, but looks like `AnalyzeTableCommand` doesn't handle the possible `NoSuchTableException` caused by `sessionState.catalog.lookupRelation`. It should be better to handle it and provide error message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14712: [SPARK-17072] [SQL] support table-level statistics gener...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/14712 If it is a hive table, I think we should respect hive's statistics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user sarutak commented on the issue: https://github.com/apache/spark/pull/14719 @cloud-fan Of course. I'll write a design doc soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14625: [SPARK-17045] [SQL] Build/move Join-related test ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14625#discussion_r75588146 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala --- @@ -245,6 +245,10 @@ class SQLQueryTestSuite extends QueryTest with SharedSQLContext { (1 to 100).map(i => (i, i.toString)).toDF("key", "value").createOrReplaceTempView("testdata") +Seq((1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)) --- End diff -- previously we have 3 pre-loaded tables: `testdata`, `arraydata`, `mapdata`, which are key-value table, array type table and map type table. For the new join tests, I think only `lowerCaseData`, `upperCaseData`, `srcpart` make sense, why can't we use `testdata` for `testData2`, `src` and `src2`? They are all key-value tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14625: [SPARK-17045] [SQL] Build/move Join-related test ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14625#discussion_r75588118 --- Diff: sql/core/src/test/resources/sql-tests/inputs/join.sql --- @@ -0,0 +1,225 @@ +-- join nested table expressions (auto_join0.q) --- End diff -- Do we need to reference to the hive `.q` file? I think hive golden file tests will be removed eventually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14717: [SPARK-17090][ML]Make tree aggregation level in linear/l...
Github user hqzizania commented on the issue: https://github.com/apache/spark/pull/14717 Thanks for the reviews :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/14719 It's really a hard problem and we have discussed it many times but can't reach a consensus. Do you mind sending a design doc first so that it's easy for other people to review and discuss? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user hqzizania commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75588057 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -389,4 +389,21 @@ private[ml] trait HasSolver extends Params { /** @group getParam */ final def getSolver: String = $(solver) } + +/** + * Trait for shared param aggregationDepth (default: 2). + */ +private[ml] trait HasAggregationDepth extends Params { + + /** + * Param for suggested depth for treeAggregate (>= 2). + * @group param --- End diff -- Could it be done in the task (https://issues.apache.org/jira/browse/SPARK-17169) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75587929 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala --- @@ -389,4 +389,21 @@ private[ml] trait HasSolver extends Params { /** @group getParam */ final def getSolver: String = $(solver) } + +/** + * Trait for shared param aggregationDepth (default: 2). + */ +private[ml] trait HasAggregationDepth extends Params { + + /** + * Param for suggested depth for treeAggregate (>= 2). + * @group param --- End diff -- these should be `@group expertParam` and `@group getExpertParam` shouldn't they? Not a big deal, but we may want to fix this before it's forgotten. We'd need to modify the codegen file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14717 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14717: [SPARK-17090][ML]Make tree aggregation level in linear/l...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/14717 LGTM. Merge into master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75587723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -256,6 +256,17 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds + /** + * Suggested depth for treeAggregate (>= 2). + * If the dimensions of features or the number of partitions are large, + * this param could be adjusted to a larger size. --- End diff -- larger value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14717: [SPARK-17090][ML]Make tree aggregation level in l...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/14717#discussion_r75587709 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -48,7 +48,7 @@ import org.apache.spark.storage.StorageLevel */ private[classification] trait LogisticRegressionParams extends ProbabilisticClassifierParams with HasRegParam with HasElasticNetParam with HasMaxIter with HasFitIntercept with HasTol - with HasStandardization with HasWeightCol with HasThreshold { + with HasStandardization with HasWeightCol with HasThreshold with HasAggregationDepth{ --- End diff -- space before `{` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14723: [SQL][WIP][Test] Supports object-based aggregation funct...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14723 Can you create a jira? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586776 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/AggregateWithObjectAggregateBufferSuite.scala --- @@ -0,0 +1,156 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.AggregateWithObjectAggregateBufferSuite.MaxWithObjectAggregateBuffer +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, Expression, GenericMutableRow, MutableRow, UnsafeRow} +import org.apache.spark.sql.catalyst.expressions.aggregate.{ImperativeAggregate, WithObjectAggregateBuffer} +import org.apache.spark.sql.execution.aggregate.{SortAggregateExec} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.{AbstractDataType, DataType, IntegerType, StructType} + +class AggregateWithObjectAggregateBufferSuite extends QueryTest with SharedSQLContext { + + import testImplicits._ + + private val data = Seq((1, 0), (3, 1), (2, 0), (6, 3), (3, 1), (4, 1), (5, 0)) + + + test("aggregate with object aggregate buffer, should not use HashAggregate") { +val df = data.toDF("a", "b") +val max = new MaxWithObjectAggregateBuffer($"a".expr) + +// Always use SortAggregateExec instead of HashAggregateExec for planning even if the aggregate +// buffer attributes are mutable fields (every field can be mutated inline like int, long...) +val allFieldsMutable = max.aggBufferSchema.map(_.dataType).forall(UnsafeRow.isMutable) +val sparkPlan = df.select(Column(max.toAggregateExpression())).queryExecution.sparkPlan +assert(allFieldsMutable == true && sparkPlan.isInstanceOf[SortAggregateExec]) + } + + test("aggregate with object aggregate buffer, no group by") { +val df = data.toDF("a", "b").coalesce(2) +checkAnswer( + df.select(objectAggregateMax($"a"), count($"a"), objectAggregateMax($"b"), count($"b")), + Seq(Row(6, 7, 3, 7)) +) + } + + test("aggregate with object aggregate buffer, with group by") { +val df = data.toDF("a", "b").coalesce(2) +checkAnswer( + df.groupBy($"b").agg(objectAggregateMax($"a"), count($"a"), objectAggregateMax($"a")), + Seq( +Row(0, 5, 3, 5), +Row(1, 4, 3, 4), +Row(3, 6, 1, 6) + ) +) + } + + test("aggregate with object aggregate buffer, empty inputs, no group by") { +val empty = Seq.empty[(Int, Int)].toDF("a", "b") +checkAnswer( + empty.select(objectAggregateMax($"a"), count($"a"), objectAggregateMax($"b"), count($"b")), + Seq(Row(Int.MinValue, 0, Int.MinValue, 0))) + } + + test("aggregate with object aggregate buffer, empty inputs, with group by") { +val empty = Seq.empty[(Int, Int)].toDF("a", "b") +checkAnswer( + empty.groupBy($"b").agg(objectAggregateMax($"a"), count($"a"), objectAggregateMax($"a")), + Seq.empty[Row]) + } + + private def objectAggregateMax(column: Column): Column = { +val max = MaxWithObjectAggregateBuffer(column.expr) +Column(max.toAggregateExpression()) + } +} + +object AggregateWithObjectAggregateBufferSuite { --- End diff -- (we do not need to put the example class inside this object.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586764 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => + + /** + * Serializes and in-place replaces the object stored in Aggregation buffer. The framework + * calls this method every time after finishing updating/merging one group (group by key). + * + * aggregationBuffer before serialization: + * + * The object stored in aggregationBuffer can be **arbitrary** Java objects defined by user. --- End diff -- Seems we want to mention that the data type is `ObjectType`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586760 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => + + /** + * Serializes and in-place replaces the object stored in Aggregation buffer. The framework + * calls this method every time after finishing updating/merging one group (group by key). + * + * aggregationBuffer before serialization: + * + * The object stored in aggregationBuffer can be **arbitrary** Java objects defined by user. + * + * aggregationBuffer after serialization: + * + * The object's type must be one of: + * + * - Null + * - Boolean + * - Byte + * - Short + * - Int + * - Long + * - Float + * - Double + * - Array[Byte] + * - org.apache.spark.sql.types.Decimal + * - org.apache.spark.unsafe.types.UTF8String + * - org.apache.spark.unsafe.types.CalendarInterval + * - org.apache.spark.sql.catalyst.util.MapData + * - org.apache.spark.sql.catalyst.util.ArrayData + * - org.apache.spark.sql.catalyst.InternalRow + * + * Code example: + * + * {{{ + * override def serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow): Unit = { + * val obj = buffer.get(mutableAggBufferOffset, ObjectType(classOf[A])).asInstanceOf[A] + * // Convert the obj to bytes, which is a serializable format. + * buffer(mutableAggBufferOffset) = toBytes(obj) --- End diff -- I am not sure it is the best example. At here, we are showing that the value of a field can be an java object or an byte array. I guess a more general question for this method will be if this approach work for all "supported" serialized types (e.g. the serialized
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586661 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => + + /** + * Serializes and in-place replaces the object stored in Aggregation buffer. The framework + * calls this method every time after finishing updating/merging one group (group by key). + * + * aggregationBuffer before serialization: + * + * The object stored in aggregationBuffer can be **arbitrary** Java objects defined by user. + * + * aggregationBuffer after serialization: + * + * The object's type must be one of: --- End diff -- How about we rephrase this part? We mentioned that we can use `arbitrary` java objects. But, here we are saying that `The object's type must be one of:`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586622 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. --- End diff -- I think at here, we need to emphasize that the buffer is an internal buffer because we will emit this buffer as the result of an aggregate operator. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14674: [SPARK-17002][CORE]: Document that spark.ssl.protocol. i...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/14674 @srowen Do you have any suggestions on our discussion above? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586350 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => --- End diff -- oh, seems this trait will be still an java `interface`. But, I think in general, we do not really need to have this line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586238 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects --- End diff -- I think it is better to remove `allow users` because it is not exposed to end-users for defining UDAFs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586232 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => --- End diff -- I guess having this line will make this trait hard to be used in Java. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586233 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects --- End diff -- `This trait allows an AggregateFunction to use ...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14723: [SQL][WIP][Test] Supports object-based aggregatio...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/14723#discussion_r75586183 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -389,3 +389,89 @@ abstract class DeclarativeAggregate def right: AttributeReference = inputAggBufferAttributes(aggBufferAttributes.indexOf(a)) } } + +/** + * This traits allow user to define an AggregateFunction which can store **arbitrary** Java objects + * in Aggregation buffer during aggregation of each key group. This trait must be mixed with + * class ImperativeAggregate. + * + * Here is how it works in a typical aggregation flow (Partial mode aggregate at Mapper side, and + * Final mode aggregate at Reducer side). + * + * Stage 1: Partial aggregate at Mapper side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores an arbitrary empty + *object, object A for example, in aggBuffer. The object A will be used to store the + *accumulated aggregation result. + * 1. Upon calling method `update(mutableAggBuffer: MutableRow, inputRow: InternalRow)` in + *current group (group by key), user extracts object A from mutableAggBuffer, and then updates + *object A with current inputRow. After updating, object A is stored back to mutableAggBuffer. + * 1. After processing all rows of current group, the framework will call method + *`serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to serialize object A + *to a serializable format in place. + * 1. The framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been + *processed. + * + * Shuffling exchange data to Reducer tasks... + * + * Stage 2: Final mode aggregate at Reducer side: + * + * 1. Upon calling method `initialize(aggBuffer: MutableRow)`, user stores a new empty object A1 + *in aggBuffer. The object A1 will be used to store the accumulated aggregation result. + * 1. Upon calling method `merge(mutableAggBuffer: MutableRow, inputAggBuffer: InternalRow)`, user + *extracts object A1 from mutableAggBuffer, and extracts object A2 from inputAggBuffer. then + *user needs to merge A1, and A2, and stores the merged result back to mutableAggBuffer. + * 1. After processing all inputAggBuffer of current group (group by key), the Spark framework will + *call method `serializeObjectAggregationBufferInPlace(aggregationBuffer: MutableRow)` to + *serialize object A1 to a serializable format in place. + * 1. The Spark framework may spill the aggregationBuffer to disk if there is not enough memory. + *It is safe since we have already convert aggregationBuffer to serializable format. + * 1. Spark framework moves on to next group, until all groups have been processed. + */ +trait WithObjectAggregateBuffer { + this: ImperativeAggregate => --- End diff -- Semes we do not really need this line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 Thanks. Reviewing each change, I think we need this PR (14705) and PR #14734 in 2.0.1 - so maybe only a few lines of conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14705 Yeah so we can do a couple of things. One is we try to cherry-pick this PR to branch-2.0 and then fix all the merge conflicts that are thrown. I think that should handle cases where the method doesn't exist in 2.0 etc. The other option is to create a new PR that is targeted at branch-2.0 (i.e. the cherry-pick / merge can be done as a part of development) and then we can review, merge it. Let me know if you or @junyangq want to try the second option -- If not I can try the first one and see how many conflicts there are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14731 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14731 **[Test build #64156 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64156/consoleFull)** for PR 14731 at commit [`b08e3c9`](https://github.com/apache/spark/commit/b08e3c9937a63a08b274a1491ea7064168646f1d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14731 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64156/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64157/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14735 **[Test build #64157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64157/consoleFull)** for PR 14735 at commit [`30815e0`](https://github.com/apache/spark/commit/30815e067a37175e0f5d4539c80db6b0ec6cc159). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14705: [SPARK-16508][SparkR] Fix CRAN undocumented/duplicated a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14705 I think a subset of this should go to 2.0.1 as well (as requirement to fix warning for CRAN in 2.0.x), but it's a non-trivial port: mllib isoreg are new in 2.1.0 only. What's the best way to proceed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [SPARK-17173][SPARKR] R MLlib refactor, cleanup, reforma...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/14735 This also tighten the signature for mllib by removing the previously unused `...`: ``` "summary", signature(object = "GeneralizedLinearRegressionModel") print.summary.GeneralizedLinearRegressionModel "summary", signature(object = "NaiveBayesModel") "summary", signature(object = "IsotonicRegressionModel") "fitted", signature(object = "KMeansModel") "summary", signature(object = "KMeansModel") "spark.naiveBayes", signature(data = "SparkDataFrame", formula = "formula" "summary", signature(object = "GaussianMixtureModel") ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14735 **[Test build #64157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64157/consoleFull)** for PR 14735 at commit [`30815e0`](https://github.com/apache/spark/commit/30815e067a37175e0f5d4539c80db6b0ec6cc159). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64155/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #64155 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64155/consoleFull)** for PR 14719 at commit [`9ddc9d8`](https://github.com/apache/spark/commit/9ddc9d858fc3d5b269a8a762b356a545f70646d6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13428: [SPARK-12666][CORE] SparkSubmit packages fix for when 'd...
Github user JoshRosen commented on the issue: https://github.com/apache/spark/pull/13428 Merged to master and branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13428: [SPARK-12666][CORE] SparkSubmit packages fix for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13428 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14731 **[Test build #64156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64156/consoleFull)** for PR 14731 at commit [`b08e3c9`](https://github.com/apache/spark/commit/b08e3c9937a63a08b274a1491ea7064168646f1d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14601: [SPARK-13979][Core] Killed executor is re spawned...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14601#discussion_r75584298 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -107,6 +107,14 @@ class SparkHadoopUtil extends Logging { if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } + // Copy any "fs.swift2d.foo=bar" properties into conf as "fs.swift2d.foo=bar" --- End diff -- may want to add `fs.wasb` for azure on Hadoop 2.7+ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12695: [SPARK-14914] Normalize Paths/URIs for windows.
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12695 As #13868 does adopt `org.apache.hadoop.io.Path`, I don't see this patch being needed âthough it may highlight some places where the new code may need applying --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12695: [SPARK-14914] Normalize Paths/URIs for windows.
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/12695 If you are working with windows paths; Hadoop's Path class contains the code to do this, stabilised and addressing the corner cases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14601: [SPARK-13979][Core] Killed executor is re spawned...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14601#discussion_r75584303 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -102,11 +102,19 @@ class SparkHadoopUtil extends Logging { hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey) hadoopConf.set("fs.s3a.secret.key", accessKey) } - // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" conf.getAll.foreach { case (key, value) => +// Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar" if (key.startsWith("spark.hadoop.")) { hadoopConf.set(key.substring("spark.hadoop.".length), value) } + // Copy any "fs.swift2d.foo=bar" properties into conf as "fs.swift2d.foo=bar" +else if (key.startsWith("fs.swift2d")){ + hadoopConf.set(key, value) --- End diff -- What's `swift2d`? It's not the swift client in `hadoop-openstack`, which is `fs.swift` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14718: [SPARK-16711] YarnShuffleService doesn't re-init properl...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/14718 Moving the jackson/leveldb dependencies isn't going to create problems on the yarn shuffle CP are they? Given the versions aren't changing, I'm not too worried âI just want to make sure --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14736: [SPARK-17024][SQL] Weird behaviour of the DataFrame when...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14736 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14659: [SPARK-16757] Set up Spark caller context to HDFS
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/14659 That Caller context doesn't list Spark as one of the users in its LimitedPrivate scope. Add a Hadoop patch there and I'll get it in. This avoids arguments later when someone breaks the API, and is especially important when using reflection, as it's harder to detect when the Class is being used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14736: [SPARK-17024][SQL] Weird behaviour of the DataFra...
GitHub user izeigerman opened a pull request: https://github.com/apache/spark/pull/14736 [SPARK-17024][SQL] Weird behaviour of the DataFrame when a column name contains dots. ## What changes were proposed in this pull request? The Spark SQL doesnât support field names that contains dots. Itâs not about queries like `select` but about any manipulations with the dataset. Here is a dataset example: ``` field1,field1.some,field2,field3.some "field1","field1.some","field2","field3.some" ``` And a code snippet: ``` scala> spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(â/tmp/test.csv").collect ``` The result of this operation: ``` org.apache.spark.sql.AnalysisException: Can't extract value from field1#0; at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:253) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$3.apply(LogicalPlan.scala:252) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:252) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) â¦. ``` The following code fails with the same error: ``` scala> val df = spark.sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("/tmp/test.csv") df: org.apache.spark.sql.DataFrame = [field1: string, field1.some: string ... 2 more fields] scala> df.select("field1", "`field1.some`", "field2", "`field3.some`").collect ``` This patch makes `LogicalPlan` treat a dot-separated string as an attribute's name in case when nested fields resolution fails. ## How was this patch tested? Tested with a mentioned CSV file in the `CSVSuite` (not committed). I'm not sure where exactly I should put a test for this. `LogicalPlanSuite` doesn't look like appropriate place for this. You can merge this pull request into a Git repository by running: $ git pull https://github.com/izeigerman/spark iaroslav/spark-17024 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14736.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14736 commit 6059dfce21c071f4022ab6a17316a85748f0729e Author: Iaroslav ZeigermanDate: 2016-08-20T19:18:24Z fix attribute resolution for the Logical Plan in case when attributes contain dots in their names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...
Github user steveloughran commented on the issue: https://github.com/apache/spark/pull/14038 Path filtering in Hadoop FS calls on anything other than filename is very suboptimal; in #14731 you can see where the filtering has been postoned until after the listing, when the full `FileStatus` entry list has been returned. As filtering is the last operation in the various listFiles calls, there's no penalty to doing the filtering after the results come in. In `FileSytem.globStatus()` the filtering takes place after the glob match, but during the scan...a larger list will be built and returned, but that is all. I think a new filter should be executed after these operations, taking the `FileStatus` object, this provides a superset of filtering possible within the Hadoop calls (timestamp, filetype, ...), with no performance penalty. It's more flexible than the simple `accept(path)`, and will guarantee that nobody using the API will implement a suboptimal filter. Consider also taking a predicate `Filesystem => Boolean`, rather than requiring callers to implement new classes. It can be fed straight into `Iterator.filter()`. I note you are making extensive use of `listLeafFiles`; that's a potentially inefficent implementation against object stores. Keep using it âI'll patch it to use `FileSystem.listFiles(path, true)` for in FS recursion and O(files/5000) listing against S3A in Hadoop 2.8; eventually Azure and swoft --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r75584026 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -293,8 +290,8 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( } /** Get file mod time from cache or fetch it from the file system */ - private def getFileModTime(path: Path) = { -fileToModTime.getOrElseUpdate(path.toString, fs.getFileStatus(path).getModificationTime()) + private def getFileModTime(fs: FileStatus) = { --- End diff -- yes, I was just being minimal about the changes. Inlining is easy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...
Github user steveloughran commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r75584030 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -241,16 +233,21 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( * The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1). * Hence they can get selected as new files again. To prevent this, files whose mod time is more * than current batch time are not considered. + * @param fs file status + * @param currentTime time of the batch + * @param modTimeIgnoreThreshold the ignore threshold + * @return true if the file has been modified within the batch window */ - private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { + private def isNewFile(fs: FileStatus, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { --- End diff -- I'll fix this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14732: [SPARK-16320] [DOC] Document G1 heap region's effect on ...
Github user petermaxlee commented on the issue: https://github.com/apache/spark/pull/14732 Looks good! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...
Github user petermaxlee commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r75583457 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -293,8 +290,8 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( } /** Get file mod time from cache or fetch it from the file system */ - private def getFileModTime(path: Path) = { -fileToModTime.getOrElseUpdate(path.toString, fs.getFileStatus(path).getModificationTime()) + private def getFileModTime(fs: FileStatus) = { --- End diff -- should we just remove this function now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/14735 This seems a big enough change that it might be good to have a JIRA for this ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...
Github user petermaxlee commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r75583446 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -241,16 +233,21 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( * The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1). * Hence they can get selected as new files again. To prevent this, files whose mod time is more * than current batch time are not considered. + * @param fs file status + * @param currentTime time of the batch + * @param modTimeIgnoreThreshold the ignore threshold + * @return true if the file has been modified within the batch window */ - private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { + private def isNewFile(fs: FileStatus, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { --- End diff -- also fs is pretty confusing, because in this context it is often used to refer to as FileSystem. We should pick a different word. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...
Github user petermaxlee commented on a diff in the pull request: https://github.com/apache/spark/pull/14731#discussion_r75583436 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala --- @@ -241,16 +233,21 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( * The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1). * Hence they can get selected as new files again. To prevent this, files whose mod time is more * than current batch time are not considered. + * @param fs file status + * @param currentTime time of the batch + * @param modTimeIgnoreThreshold the ignore threshold + * @return true if the file has been modified within the batch window */ - private def isNewFile(path: Path, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { + private def isNewFile(fs: FileStatus, currentTime: Long, modTimeIgnoreThreshold: Long): Boolean = { --- End diff -- indent is wrong here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #64155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64155/consoleFull)** for PR 14719 at commit [`9ddc9d8`](https://github.com/apache/spark/commit/9ddc9d858fc3d5b269a8a762b356a545f70646d6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64154/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14735 **[Test build #64154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64154/consoleFull)** for PR 14735 at commit [`1ef18d6`](https://github.com/apache/spark/commit/1ef18d6abfe854c95e0323a406065d9ee4f11c15). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14155 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64149/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14155 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14155: [SPARK-16498][SQL] move hive hack for data source table ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14155 **[Test build #64149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64149/consoleFull)** for PR 14155 at commit [`38b838a`](https://github.com/apache/spark/commit/38b838a9d27d5e11bad5f5e7040fe2d6d2e56216). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14734 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64153/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14734 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14734 **[Test build #64153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64153/consoleFull)** for PR 14734 at commit [`4b6c42e`](https://github.com/apache/spark/commit/4b6c42ec1861cb3e48e85d83c22caccb910532ce). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14734 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14734 **[Test build #64152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64152/consoleFull)** for PR 14734 at commit [`341a2f8`](https://github.com/apache/spark/commit/341a2f8b85d584e9715605c9689d4c77b53483a2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14734 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64152/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14735 **[Test build #64151 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64151/consoleFull)** for PR 14735 at commit [`3ea30bb`](https://github.com/apache/spark/commit/3ea30bb3b5d22626f6de6e0699504180f267dfdc). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64151/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14735 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14735: [MINOR][SPARKR] R MLlib refactor, cleanup, reformat, fix...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14735 **[Test build #64154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64154/consoleFull)** for PR 14735 at commit [`1ef18d6`](https://github.com/apache/spark/commit/1ef18d6abfe854c95e0323a406065d9ee4f11c15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64148/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14719 **[Test build #64148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64148/consoleFull)** for PR 14719 at commit [`48a0775`](https://github.com/apache/spark/commit/48a0775e80cc91340cb0754c62b35868f319cf44). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14734: [SPARK-16508][SPARKR] small doc updates
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14734 **[Test build #64153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/64153/consoleFull)** for PR 14734 at commit [`4b6c42e`](https://github.com/apache/spark/commit/4b6c42ec1861cb3e48e85d83c22caccb910532ce). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14734: [SPARK-16508][SPARKR] small doc updates
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14734#discussion_r75581656 --- Diff: R/pkg/R/DataFrame.R --- @@ -2880,7 +2880,7 @@ setMethod("fillna", #' #' @param x a SparkDataFrame. #' @param row.names NULL or a character vector giving the row names for the data frame. --- End diff -- updated a few places we are referencing `NULL` literally. there are more "null" in DataFrame or column function documentation but they are in somewhat gray area - JVM `null` are mapped to R `NA` (and not to `NULL`) - and we should look into the best way to name functions or document them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/64146/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14719: [SPARK-17154][SQL] Wrong result can be returned or Analy...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14719 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org