[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55482347 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Let me respond the original question by @cloud-fan We will not see an empty Project, if the child has more than one columns. The empty Project only happens after columnPruning. I am fine, if we want to add an extra rule for eliminating Project only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194167614 @cloud-fan Thanks !! Actually i remember exploring this option. There is a generate syntax for outer lateral view like following - ```SQL SELECT * FROM src LATERAL VIEW OUTER explode(array()) C AS a limit 10; ``` I thought if we converted this generate operator as a sub select where the generator is in projection list , we will loose the outer semantics. Thats why i felt safer to generate as a LATERAL VIEW clause. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55482082 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Thanks @viirya @cloud-fan ! I am not sure which way is better. ```scala case p @ Project(_, l: LeafNode) if !l.isInstanceOf[OneRowRelation] => p ``` My concern is the above line looks more hacky than the current PR fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user olarayej commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r55481841 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), --- End diff -- Thanks @felixcheung for investigating this further! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481794 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Yea. As I posted before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update Java Doc in Spark Submit
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11600#issuecomment-194164345 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481593 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Without the fix, I hit the error: ``` == FAIL: Plans do not match === Project [1 AS 1#0] Project [1 AS 1#0] !+- Project +- OneRowRelation$ ! +- OneRowRelation$ ScalaTestFailureLocation: org.apache.spark.sql.catalyst.plans.PlanTest at (PlanTest.scala:59) org.scalatest.exceptions.TestFailedException: == FAIL: Plans do not match === Project [1 AS 1#0] Project [1 AS 1#0] !+- Project +- OneRowRelation$ ! +- OneRowRelation$ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481539 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Nevermind. It is because another rule I added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Update Java Doc in Spark Submit
GitHub user AhmedKamal opened a pull request: https://github.com/apache/spark/pull/11600 Update Java Doc in Spark Submit ## What changes were proposed in this pull request? JIRA : https://issues.apache.org/jira/browse/SPARK-13769 The java doc here (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51) needs to be updated from "The latter two operations are currently supported only for standalone cluster mode." to "The latter two operations are currently supported only for standalone and mesos cluster modes." You can merge this pull request into a Git repository by running: $ git pull https://github.com/AhmedKamal/spark SPARK-13769 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11600.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11600 commit 104b5059c315425ce0861324005f6b3d86f7cd3c Author: Ahmed Kamal Date: 2016-03-09T07:41:16Z Update Java Doc in Spark Submit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481375 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Actually, I just try to run this tests without this patch. It passes. @gatorsmile Can you verify it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481341 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- How about this? ```SQL case p @ Project(_, l: LeafNode) if ! l.isInstanceOf[OneRowRelation] => p ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55481255 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- ```SQL case p @ Project(_, l: LeafNode) => p ``` There is another case above it. Thus, it will stop here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55480958 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- But a `Project` with empty projectList also has no output right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11547#issuecomment-194153088 Made a few minor comments, pending those LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194153085 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52725/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194153084 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11547#discussion_r55480608 --- Diff: examples/src/main/python/ml/train_validation_split.py --- @@ -0,0 +1,69 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark import SparkContext +# $example on$ +from pyspark.ml import Pipeline +from pyspark.ml.evaluation import RegressionEvaluator +from pyspark.ml.regression import LinearRegression +from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit +from pyspark.sql import SQLContext +# $example off$ + +""" +This example demonstrats applying TrainValidationSplit to split data +and preform model selection, as well as applying Pipelines. +Run with: + + bin/spark-submit examples/src/main/python/ml/train_validation_split.py +""" + +if __name__ == "__main__": +sc = SparkContext(appName="TrainValidationSplit") +sqlContext = SQLContext(sc) +# $example on$ +# Prepare training and test data. +data = sqlContext.read.format("libsvm")\ +.load("data/mllib/sample_linear_regression_data.txt") +train, test = data.randomSplit([0.7, 0.3]) +lr = LinearRegression(maxIter=10, regParam=0.1) + +# We use a ParamGridBuilder to construct a grid of parameters to search over. +# TrainValidationSplit will try all combinations of values and determine best model using +# the evaluator. +paramGrid = ParamGridBuilder()\ +.addGrid(lr.regParam, [0.1, 0.01]) \ +.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\ +.build() + +# In this case the estimator is simply the linear regression. +# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. +tvs = TrainValidationSplit(estimator=lr, + estimatorParamMaps=paramGrid, + evaluator=RegressionEvaluator(), + # 80% of the data will be used for training, 20% for validation. + trainRatio=0.8) + +# Run TrainValidationSplit, chosing the set of parameters that optimizes the evaluator. --- End diff -- Perhaps we could change this comment to something like `Running TrainValidationSplit returns the model with the combination of parameters that performed best on the validation set.` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11547#discussion_r55480665 --- Diff: examples/src/main/python/ml/train_validation_split.py --- @@ -0,0 +1,69 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark import SparkContext +# $example on$ +from pyspark.ml import Pipeline +from pyspark.ml.evaluation import RegressionEvaluator +from pyspark.ml.regression import LinearRegression +from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit +from pyspark.sql import SQLContext +# $example off$ + +""" +This example demonstrats applying TrainValidationSplit to split data +and preform model selection, as well as applying Pipelines. +Run with: + + bin/spark-submit examples/src/main/python/ml/train_validation_split.py +""" + +if __name__ == "__main__": +sc = SparkContext(appName="TrainValidationSplit") +sqlContext = SQLContext(sc) +# $example on$ +# Prepare training and test data. +data = sqlContext.read.format("libsvm")\ +.load("data/mllib/sample_linear_regression_data.txt") +train, test = data.randomSplit([0.7, 0.3]) +lr = LinearRegression(maxIter=10, regParam=0.1) + +# We use a ParamGridBuilder to construct a grid of parameters to search over. +# TrainValidationSplit will try all combinations of values and determine best model using +# the evaluator. +paramGrid = ParamGridBuilder()\ +.addGrid(lr.regParam, [0.1, 0.01]) \ +.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\ +.build() + +# In this case the estimator is simply the linear regression. +# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. +tvs = TrainValidationSplit(estimator=lr, + estimatorParamMaps=paramGrid, + evaluator=RegressionEvaluator(), + # 80% of the data will be used for training, 20% for validation. + trainRatio=0.8) + +# Run TrainValidationSplit, chosing the set of parameters that optimizes the evaluator. +model = tvs.fit(train) +# Make predictions on test data. model is the model with combination of parameters --- End diff -- ... and here we can then simply have `Make predictions on test data using the model` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194152932 **[Test build #52725 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52725/consoleFull)** for PR 11596 at commit [`d7f0207`](https://github.com/apache/spark/commit/d7f020701b3b063beb00293af3ff4149211926ab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55480611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- Because `OneRowRelation` has no output. So its output is different to its parent `Project`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194152170 I'm not sure which one is better, it depends on which one is easier to understand and implement. I'm ok to generate verbose SQL string, but the generator itself should be as simple and robust as possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55480282 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -380,6 +380,9 @@ object ColumnPruning extends Rule[LogicalPlan] { p } +// Eliminate the Projects with empty projectList +case p @ Project(projectList, child) if projectList.isEmpty => child --- End diff -- I'm thinking of the correctness of this rule. Actually this is not column pruning, but add more columns, as `child` may have more one columns. And why this rule `case p @ Project(projectList, child) if sameOutput(child.output, p.output) => child` can't work? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13706] [ML] Add Python Example for Trai...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11547#discussion_r55479803 --- Diff: examples/src/main/python/ml/train_validation_split.py --- @@ -0,0 +1,69 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark import SparkContext +# $example on$ +from pyspark.ml import Pipeline +from pyspark.ml.evaluation import RegressionEvaluator +from pyspark.ml.regression import LinearRegression +from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit +from pyspark.sql import SQLContext +# $example off$ + +""" +This example demonstrats applying TrainValidationSplit to split data --- End diff -- Typo: `demonstrats` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13636][SQL] Directly consume UnsafeRow ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11484#issuecomment-194146675 **[Test build #52733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52733/consoleFull)** for PR 11484 at commit [`dea644a`](https://github.com/apache/spark/commit/dea644a74251c62f1ccb0fd095083d434acd1a8c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13636][SQL] Directly consume UnsafeRow ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11484#issuecomment-194145448 **[Test build #52732 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52732/consoleFull)** for PR 11484 at commit [`6400eb2`](https://github.com/apache/spark/commit/6400eb22aefa986fb5d96d3d6d0242778e2f332a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194143635 @cloud-fan Added another two cases. Feel free to let me know if you want me to add more cases. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11594#issuecomment-194142858 **[Test build #52731 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52731/consoleFull)** for PR 11594 at commit [`84bac4d`](https://github.com/apache/spark/commit/84bac4d7309121c0fd699ea2b06aa05b775a4d81). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11594#issuecomment-194141605 **[Test build #52730 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52730/consoleFull)** for PR 11594 at commit [`54c3f49`](https://github.com/apache/spark/commit/54c3f49f6f558b77d8bd805269ed1ca8b8be3fad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194141048 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52724/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194141046 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194140806 **[Test build #52724 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52724/consoleFull)** for PR 11599 at commit [`a31b1b5`](https://github.com/apache/spark/commit/a31b1b588949f2f92981f7d1a7d04d6e1806ccd1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55477889 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ + } + + def compute( + inputIterator: Iterator[_], + partitionIndex: Int, + context: TaskContext): Iterator[U] = { +// Timing start +bootTime = System.currentTimeMillis / 1000.0 + +// we expect two connections +val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) +val listenPort = serverSocket.getLocalPort() + +// The stdout/stderr is shared by multiple tasks, because we use one daemon +// to launch child process as worker. +val errThread = RRunner.createRWorker(listenPort) + +// We use two sockets to separate input and output, then it's easy to manage +// the lifecycle of them to avoid deadlock. +// TODO: optimize it to use one socket + +// the socket used to send out the input of task +serverSocket.setSoTimeout(1) +val inSocket = serverSocket.accept() +startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) + +// the socket used to receive the output of task +val outSocket = serverSocket.accept() +val inputStream = new BufferedInputStream(outSocket.getInputStream) +dataStream = new DataInputStream(inputStream) +serverSocket.close() + +try { + return new Iterator[U] { +def next(): U = { + val obj = _nextObj + if (hasNext) { +_nextObj = read() + } + obj +} + +var _nextObj = read() + +def hasNext(): Boolean = { + val hasMore = (_nextObj != null) + if (!hasMore) { +dataStream.close() + } + hasMore +} + } +} catch { + case e: Exception => +throw new SparkException("R computation failed with\n " + errThread.getLines()) +} + } + + /** + * Start a thread to write RDD data to the R process. + */ + private def startStdinThread( + output: OutputStream, + iter: Iterator[_], + partitionIndex: Int): Unit = { +val env = SparkEnv.get +val taskContext = TaskContext.get() +val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt +val stream = new BufferedOutputStream(output, bufferSize) + +new Thread("writer for R") { + override def run(): Unit = { +try { + SparkEnv.set(env) + TaskContext.setTaskContext(taskContext) + val dataOut = new DataOutputStream(stream) + dataOut.writeInt(partitionIndex) + + SerDe.writeString(dataOut, deserializer) + SerDe.writeString(dataOut, serializer)
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55477830 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { --- End diff -- line 44 ~line 51 are new code. readData is a val holding a function for reading data returned from the R worker. It is one of the following 3 functions: readShuffledData(), readByteArrayData(), readStringData() According the constructor parameters, we can decide which function is used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55477700 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ + } + + def compute( + inputIterator: Iterator[_], + partitionIndex: Int, + context: TaskContext): Iterator[U] = { +// Timing start +bootTime = System.currentTimeMillis / 1000.0 + +// we expect two connections +val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) +val listenPort = serverSocket.getLocalPort() + +// The stdout/stderr is shared by multiple tasks, because we use one daemon +// to launch child process as worker. +val errThread = RRunner.createRWorker(listenPort) + +// We use two sockets to separate input and output, then it's easy to manage +// the lifecycle of them to avoid deadlock. +// TODO: optimize it to use one socket + +// the socket used to send out the input of task +serverSocket.setSoTimeout(1) +val inSocket = serverSocket.accept() +startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) + +// the socket used to receive the output of task +val outSocket = serverSocket.accept() +val inputStream = new BufferedInputStream(outSocket.getInputStream) +dataStream = new DataInputStream(inputStream) +serverSocket.close() + +try { + return new Iterator[U] { +def next(): U = { + val obj = _nextObj + if (hasNext) { +_nextObj = read() + } + obj +} + +var _nextObj = read() + +def hasNext(): Boolean = { + val hasMore = (_nextObj != null) + if (!hasMore) { +dataStream.close() + } + hasMore +} + } +} catch { + case e: Exception => +throw new SparkException("R computation failed with\n " + errThread.getLines()) +} + } + + /** + * Start a thread to write RDD data to the R process. + */ + private def startStdinThread( + output: OutputStream, + iter: Iterator[_], + partitionIndex: Int): Unit = { +val env = SparkEnv.get +val taskContext = TaskContext.get() +val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt +val stream = new BufferedOutputStream(output, bufferSize) + +new Thread("writer for R") { + override def run(): Unit = { +try { + SparkEnv.set(env) + TaskContext.setTaskContext(taskContext) + val dataOut = new DataOutputStream(stream) + dataOut.writeInt(partitionIndex) + + SerDe.writeString(dataOut, deserializer) + SerDe.writeString(dataOut, serializer)
[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10821#issuecomment-194140043 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13759][SQL] Add IsNotNull constraints f...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11594#issuecomment-194140033 **[Test build #52729 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52729/consoleFull)** for PR 11594 at commit [`635eae2`](https://github.com/apache/spark/commit/635eae217f293af248a68cafca7ba6187c2a1886). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10821#issuecomment-194140046 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52722/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55477584 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ + } + + def compute( + inputIterator: Iterator[_], + partitionIndex: Int, + context: TaskContext): Iterator[U] = { +// Timing start +bootTime = System.currentTimeMillis / 1000.0 + +// we expect two connections +val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) +val listenPort = serverSocket.getLocalPort() + +// The stdout/stderr is shared by multiple tasks, because we use one daemon +// to launch child process as worker. +val errThread = RRunner.createRWorker(listenPort) + +// We use two sockets to separate input and output, then it's easy to manage +// the lifecycle of them to avoid deadlock. +// TODO: optimize it to use one socket + +// the socket used to send out the input of task +serverSocket.setSoTimeout(1) +val inSocket = serverSocket.accept() +startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) + +// the socket used to receive the output of task +val outSocket = serverSocket.accept() +val inputStream = new BufferedInputStream(outSocket.getInputStream) +dataStream = new DataInputStream(inputStream) +serverSocket.close() + +try { + return new Iterator[U] { +def next(): U = { + val obj = _nextObj + if (hasNext) { +_nextObj = read() + } + obj +} + +var _nextObj = read() + +def hasNext(): Boolean = { + val hasMore = (_nextObj != null) + if (!hasMore) { +dataStream.close() + } + hasMore +} + } +} catch { + case e: Exception => +throw new SparkException("R computation failed with\n " + errThread.getLines()) +} + } + + /** + * Start a thread to write RDD data to the R process. + */ + private def startStdinThread( + output: OutputStream, + iter: Iterator[_], + partitionIndex: Int): Unit = { +val env = SparkEnv.get +val taskContext = TaskContext.get() +val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt +val stream = new BufferedOutputStream(output, bufferSize) + +new Thread("writer for R") { + override def run(): Unit = { +try { + SparkEnv.set(env) + TaskContext.setTaskContext(taskContext) + val dataOut = new DataOutputStream(stream) + dataOut.writeInt(partitionIndex) + + SerDe.writeString(dataOut, deserializer) + SerDe.writeString(dataOut, serializer)
[GitHub] spark pull request: [SPARK-12893][YARN] Fix history URL redirect e...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10821#issuecomment-194139901 **[Test build #52722 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52722/consoleFull)** for PR 10821 at commit [`c98c3b5`](https://github.com/apache/spark/commit/c98c3b5b28c834c9652f178b1669e2af977e6123). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194139575 @cloud-fan Yeah.. that should be possible, You are thinking to represent each Generate as a sub select clause ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194139422 **[Test build #52728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52728/consoleFull)** for PR 11555 at commit [`dab7a2f`](https://github.com/apache/spark/commit/dab7a2f1a5cc0438405b0fa1cf532ab883bed7e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13320] [SQL] Support Star in CreateStru...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11208#issuecomment-194138373 **[Test build #52727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52727/consoleFull)** for PR 11208 at commit [`e060dea`](https://github.com/apache/spark/commit/e060deaaf09d122966f090bf3b86895636418664). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13320] [SQL] Support Star in CreateStru...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/11208#issuecomment-194136402 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55476565 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ + } + + def compute( + inputIterator: Iterator[_], + partitionIndex: Int, + context: TaskContext): Iterator[U] = { +// Timing start +bootTime = System.currentTimeMillis / 1000.0 + +// we expect two connections +val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) +val listenPort = serverSocket.getLocalPort() + +// The stdout/stderr is shared by multiple tasks, because we use one daemon +// to launch child process as worker. +val errThread = RRunner.createRWorker(listenPort) + +// We use two sockets to separate input and output, then it's easy to manage +// the lifecycle of them to avoid deadlock. +// TODO: optimize it to use one socket + +// the socket used to send out the input of task +serverSocket.setSoTimeout(1) +val inSocket = serverSocket.accept() +startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) + +// the socket used to receive the output of task +val outSocket = serverSocket.accept() +val inputStream = new BufferedInputStream(outSocket.getInputStream) +dataStream = new DataInputStream(inputStream) +serverSocket.close() + +try { + return new Iterator[U] { +def next(): U = { + val obj = _nextObj + if (hasNext) { +_nextObj = read() + } + obj +} + +var _nextObj = read() + +def hasNext(): Boolean = { + val hasMore = (_nextObj != null) + if (!hasMore) { +dataStream.close() + } + hasMore +} + } +} catch { + case e: Exception => +throw new SparkException("R computation failed with\n " + errThread.getLines()) +} + } + + /** + * Start a thread to write RDD data to the R process. + */ + private def startStdinThread( + output: OutputStream, + iter: Iterator[_], + partitionIndex: Int): Unit = { +val env = SparkEnv.get +val taskContext = TaskContext.get() +val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt +val stream = new BufferedOutputStream(output, bufferSize) + +new Thread("writer for R") { + override def run(): Unit = { +try { + SparkEnv.set(env) + TaskContext.setTaskContext(taskContext) + val dataOut = new DataOutputStream(stream) + dataOut.writeInt(partitionIndex) + + SerDe.writeString(dataOut, deserializer) + SerDe.writeString(dataOut, serializer)
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194133518 But a `Generate` operator only has one `generator`, so I think we can convert any kind of `Generate` to SELECT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10947#issuecomment-194133921 cc @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55476210 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -789,117 +752,193 @@ private[spark] class BlockManager( // lockNewBlockForWriting returned a read lock on the existing block, so we must free it: releaseLock(blockId) } -return DoPutSucceeded +return true } } val startTimeMs = System.currentTimeMillis -// Size of the block in bytes -var size = 0L - -// The level we actually use to put the block -val putLevel = effectiveStorageLevel.getOrElse(level) - -// If we're storing bytes, then initiate the replication before storing them locally. +// Since we're storing bytes, initiate the replication before storing them locally. // This is faster as data is already serialized and ready to send. -val replicationFuture = data match { - case b: ByteBufferValues if putLevel.replication > 1 => -// Duplicate doesn't copy the bytes, but just creates a wrapper -val bufferView = b.buffer.duplicate() -Future { - // This is a blocking action and should run in futureExecutionContext which is a cached - // thread pool - replicate(blockId, bufferView, putLevel) -}(futureExecutionContext) - case _ => null +val replicationFuture = if (level.replication > 1) { + // Duplicate doesn't copy the bytes, but just creates a wrapper + val bufferView = bytes.duplicate() + Future { +// This is a blocking action and should run in futureExecutionContext which is a cached +// thread pool +replicate(blockId, bufferView, level) + }(futureExecutionContext) +} else { + null } var blockWasSuccessfullyStored = false -var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None - -putBlockInfo.synchronized { - logTrace("Put for block %s took %s to get into synchronized block" -.format(blockId, Utils.getUsedTimeMs(startTimeMs))) - try { -if (putLevel.useMemory) { - // Put it in memory first, even if it also has useDisk set to true; - // We will drop it to disk later if the memory store can't hold it. - data match { -case IteratorValues(iterator) => - memoryStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -case Left(iter) => - iteratorFromFailedMemoryStorePut = Some(iter) - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - memoryStore.putBytes(blockId, bytes, putLevel) - } -} else if (putLevel.useDisk) { - data match { -case IteratorValues(iterator) => - diskStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -// putIterator() will never return Left (see its return type). - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - diskStore.putBytes(blockId, bytes, putLevel) - } +bytes.rewind() +val size = bytes.limit() + +try { + if (level.useMemory) { +// Put it in memory first, even if it also has useDisk set to true; +// We will drop it to disk later if the memory store can't hold it. +val putSucceeded = if (level.deserialized) { + val values = dataDeserialize(blockId, bytes.duplicate()) + memoryStore.putIterator(blockId, values, level).isRight } else { - assert(putLevel == StorageLevel.NONE) - throw new BlockException( -blockId, s"Attempted to put block $blockId without specifying storage level!") + memoryStore.putBytes(blockId, size, () => bytes) +} +if (!putSucceeded && level.useDisk) { + logWarning(s"Persisting block $blockId to disk instead.") + diskStore.putBytes(blockId, bytes) } + } else if (level.useDisk) { +diskStore.putBytes(blockId, bytes) + } -val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo) -blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid -if (blockWasSuccessfullyStored) { -
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55476161 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -789,117 +752,193 @@ private[spark] class BlockManager( // lockNewBlockForWriting returned a read lock on the existing block, so we must free it: releaseLock(blockId) } -return DoPutSucceeded +return true } } val startTimeMs = System.currentTimeMillis -// Size of the block in bytes -var size = 0L - -// The level we actually use to put the block -val putLevel = effectiveStorageLevel.getOrElse(level) - -// If we're storing bytes, then initiate the replication before storing them locally. +// Since we're storing bytes, initiate the replication before storing them locally. // This is faster as data is already serialized and ready to send. -val replicationFuture = data match { - case b: ByteBufferValues if putLevel.replication > 1 => -// Duplicate doesn't copy the bytes, but just creates a wrapper -val bufferView = b.buffer.duplicate() -Future { - // This is a blocking action and should run in futureExecutionContext which is a cached - // thread pool - replicate(blockId, bufferView, putLevel) -}(futureExecutionContext) - case _ => null +val replicationFuture = if (level.replication > 1) { + // Duplicate doesn't copy the bytes, but just creates a wrapper + val bufferView = bytes.duplicate() + Future { +// This is a blocking action and should run in futureExecutionContext which is a cached +// thread pool +replicate(blockId, bufferView, level) + }(futureExecutionContext) +} else { + null } var blockWasSuccessfullyStored = false -var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None - -putBlockInfo.synchronized { - logTrace("Put for block %s took %s to get into synchronized block" -.format(blockId, Utils.getUsedTimeMs(startTimeMs))) - try { -if (putLevel.useMemory) { - // Put it in memory first, even if it also has useDisk set to true; - // We will drop it to disk later if the memory store can't hold it. - data match { -case IteratorValues(iterator) => - memoryStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -case Left(iter) => - iteratorFromFailedMemoryStorePut = Some(iter) - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - memoryStore.putBytes(blockId, bytes, putLevel) - } -} else if (putLevel.useDisk) { - data match { -case IteratorValues(iterator) => - diskStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -// putIterator() will never return Left (see its return type). - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - diskStore.putBytes(blockId, bytes, putLevel) - } +bytes.rewind() +val size = bytes.limit() + +try { + if (level.useMemory) { +// Put it in memory first, even if it also has useDisk set to true; +// We will drop it to disk later if the memory store can't hold it. +val putSucceeded = if (level.deserialized) { + val values = dataDeserialize(blockId, bytes.duplicate()) + memoryStore.putIterator(blockId, values, level).isRight } else { - assert(putLevel == StorageLevel.NONE) - throw new BlockException( -blockId, s"Attempted to put block $blockId without specifying storage level!") + memoryStore.putBytes(blockId, size, () => bytes) +} +if (!putSucceeded && level.useDisk) { + logWarning(s"Persisting block $blockId to disk instead.") + diskStore.putBytes(blockId, bytes) } + } else if (level.useDisk) { +diskStore.putBytes(blockId, bytes) + } -val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo) -blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid -if (blockWasSuccessfullyStored) { -
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user sun-rui commented on a diff in the pull request: https://github.com/apache/spark/pull/10947#discussion_r55476044 --- Diff: core/src/main/scala/org/apache/spark/api/r/RRunner.scala --- @@ -0,0 +1,367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.api.r + +import java.io._ +import java.net.{InetAddress, ServerSocket} +import java.util.Arrays + +import scala.io.Source +import scala.util.Try + +import org.apache.spark._ +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.util.Utils + +/** + * A helper class to run R UDFs in Spark. + */ +private[spark] class RRunner[U]( +func: Array[Byte], +deserializer: String, +serializer: String, +packageNames: Array[Byte], +broadcastVars: Array[Broadcast[Object]], +numPartitions: Int = -1) + extends Logging { + private var bootTime: Double = _ + private var dataStream: DataInputStream = _ + val readData = numPartitions match { +case -1 => + serializer match { +case SerializationFormats.STRING => readStringData _ +case _ => readByteArrayData _ + } +case _ => readShuffledData _ + } + + def compute( + inputIterator: Iterator[_], + partitionIndex: Int, + context: TaskContext): Iterator[U] = { +// Timing start +bootTime = System.currentTimeMillis / 1000.0 + +// we expect two connections +val serverSocket = new ServerSocket(0, 2, InetAddress.getByName("localhost")) +val listenPort = serverSocket.getLocalPort() + +// The stdout/stderr is shared by multiple tasks, because we use one daemon +// to launch child process as worker. +val errThread = RRunner.createRWorker(listenPort) + +// We use two sockets to separate input and output, then it's easy to manage +// the lifecycle of them to avoid deadlock. +// TODO: optimize it to use one socket + +// the socket used to send out the input of task +serverSocket.setSoTimeout(1) +val inSocket = serverSocket.accept() +startStdinThread(inSocket.getOutputStream(), inputIterator, partitionIndex) + +// the socket used to receive the output of task +val outSocket = serverSocket.accept() +val inputStream = new BufferedInputStream(outSocket.getInputStream) +dataStream = new DataInputStream(inputStream) +serverSocket.close() + +try { + return new Iterator[U] { +def next(): U = { + val obj = _nextObj + if (hasNext) { +_nextObj = read() + } + obj +} + +var _nextObj = read() + +def hasNext(): Boolean = { + val hasMore = (_nextObj != null) + if (!hasMore) { +dataStream.close() + } + hasMore +} + } +} catch { + case e: Exception => +throw new SparkException("R computation failed with\n " + errThread.getLines()) +} + } + + /** + * Start a thread to write RDD data to the R process. + */ + private def startStdinThread( + output: OutputStream, + iter: Iterator[_], + partitionIndex: Int): Unit = { +val env = SparkEnv.get +val taskContext = TaskContext.get() +val bufferSize = System.getProperty("spark.buffer.size", "65536").toInt +val stream = new BufferedOutputStream(output, bufferSize) + +new Thread("writer for R") { + override def run(): Unit = { +try { + SparkEnv.set(env) + TaskContext.setTaskContext(taskContext) + val dataOut = new DataOutputStream(stream) + dataOut.writeInt(partitionIndex) + + SerDe.writeString(dataOut, deserializer) + SerDe.writeString(dataOut, serializer)
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55475930 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala --- @@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { "f1", "b[0].f1", "f1", "c[foo]", "d[0]" ) } + + test("SQL generator for explode in projection list") { +// Basic Explode +checkHiveQl("SELECT explode(array(1,2,3)) FROM src") + +// Explode with Alias +checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src") + +// Explode without FROM +checkHiveQl("select explode(array(1,2,3)) AS gencol") + +// non-generated columns in projection list +checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as c3 FROM t1") + } + + test("SQL generation for json_tuple as generator") { +checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5') FROM parquet_t3") + } + + test("SQL generation for lateral views") { +// Filter and OUTER clause +checkHiveQl( + """ +|SELECT key, value +|FROM t1 +|LATERAL VIEW OUTER explode(value) gentab as gencol +|WHERE key = 1 + """.stripMargin +) + +// single lateral view +checkHiveQl( + """ +|SELECT * +|FROM t1 +|LATERAL VIEW explode(array(1,2,3)) gentab AS gencol +|SORT BY key ASC, gencol ASC LIMIT 1 + """.stripMargin +) + +// multiple lateral views +checkHiveQl( + """ +|SELECT gentab2.* +|FROM t1 +|LATERAL VIEW explode(array(array(1,2,3))) gentab1 AS gencol1 +|LATERAL VIEW explode(gentab1.gencol1) gentab2 AS gencol2 LIMIT 3 + """.stripMargin +) + +// No generated column aliases +checkHiveQl( + """SELECT gentab.* +|FROM t1 +|LATERAL VIEW explode(map('key1', 100, 'key2', 200)) gentab limit 2 + """.stripMargin +) + } + + test("SQL generation for lateral views in subquery") { +// Subquries in FROM clause using Generate +checkHiveQl( + """ +|SELECT subq.gencol +|FROM +|(SELECT * from t1 LATERAL VIEW explode(value) gentab AS gencol) subq + """.stripMargin) + +checkHiveQl( + """ +|SELECT subq.key +|FROM +|(SELECT key, value from t1 LATERAL VIEW explode(value) gentab AS gencol) subq + """.stripMargin +) + } + + test("SQL generation for UDTF") { --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55475917 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala --- @@ -297,6 +303,60 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi ) } + /* This function handles the SQL generation when generators. --- End diff -- Will do. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55475944 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala --- @@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { "f1", "b[0].f1", "f1", "c[foo]", "d[0]" ) } + + test("SQL generator for explode in projection list") { +// Basic Explode +checkHiveQl("SELECT explode(array(1,2,3)) FROM src") + +// Explode with Alias +checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src") + +// Explode without FROM +checkHiveQl("select explode(array(1,2,3)) AS gencol") + +// non-generated columns in projection list +checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as c3 FROM t1") + } + + test("SQL generation for json_tuple as generator") { +checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5') FROM parquet_t3") + } + + test("SQL generation for lateral views") { --- End diff -- OK. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55475915 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala --- @@ -94,6 +94,12 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi case Distinct(p: Project) => projectToSQL(p, isDistinct = true) +case p@ Project(_, g: Generate) => --- End diff -- Will do. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user dilipbiswal commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194130778 @cloud-fan Thank you. 1. Can any Generate operator be converted to a LATERAL VIEW? Does the plan tree have to match some kind of pattern? Yes, based on the tests i have tried. I am testing this more to see if i uncover any issue. Now i am converting ```SQL select explode(array(1,2,3)) from t1 ``` to ```SQL select tablequalifier.colname from t1 LATERAL VIEW explode(array(1,2,3)) tablequalifier as colname ``` 2. Can any Generate operator be converted to a "SELECT ...udtf..." format? No. As we have a restriction of allowing one generate in a projection list. Thats why we try to use the LATERAL VIEW clause. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11517#issuecomment-194128908 **[Test build #52726 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52726/consoleFull)** for PR 11517 at commit [`6144810`](https://github.com/apache/spark/commit/6144810614245912f659c94f5341dd351b08e89e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/11517#issuecomment-194128858 cc @davies @nongli --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11517#discussion_r55475400 --- Diff: core/src/test/scala/org/apache/spark/rdd/PartitionwiseSampledRDDSuite.scala --- @@ -29,6 +29,8 @@ class MockSampler extends RandomSampler[Long, Long] { s = seed } + override def sample(): Int = 1 --- End diff -- A change needed for new `RandomSampler` API, also in #11578. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13674][SQL] Add wholestage codegen supp...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/11517#discussion_r55475367 --- Diff: core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala --- @@ -41,6 +41,12 @@ trait RandomSampler[T, U] extends Pseudorandom with Cloneable with Serializable /** take a random sample */ def sample(items: Iterator[T]): Iterator[U] + /** + * Whether to sample the next item or not. + * Return how many times the next item will be sampled. Return 0 if it is not sampled. + */ + def sample(): Int + --- End diff -- These changes on `RandomSampler` is submitted in #11578. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13436][SPARKR] Added parameter drop to ...
Github user olarayej commented on the pull request: https://github.com/apache/spark/pull/11318#issuecomment-194128322 @felixcheung @sun-rui @shivaram Can you folks please take a look at this one? Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11596#issuecomment-194126735 **[Test build #52725 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52725/consoleFull)** for PR 11596 at commit [`d7f0207`](https://github.com/apache/spark/commit/d7f020701b3b063beb00293af3ff4149211926ab). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194126377 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194126378 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52721/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194126222 **[Test build #52721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52721/consoleFull)** for PR 11599 at commit [`0fa21ac`](https://github.com/apache/spark/commit/0fa21acd2614fbdf128561b34c72088008d62a05). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12792][SPARKR] Refactor RRDD to support...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10947#issuecomment-194125116 For bulk movement of code, can you comment in the pr yourself which part was actually changed, and which part was simply moving code from one place to another? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on the pull request: https://github.com/apache/spark/pull/11486#issuecomment-194125311 From SparkR test failure: ``` 1. Error: naiveBayes --- there is no package called 'mlbench' 1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage")) 2: eval(code, new_test_environment) 3: eval(expr, envir, enclos) 4: data(HouseVotes84, package = "mlbench") at test_mllib.R:146 5: find.package(package, lib.loc, verbose = verbose) 6: stop(gettextf("there is no package called %s", sQuote(pkg)), domain = NA) Error: Test failures ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11486#discussion_r55474419 --- Diff: R/pkg/R/mllib.R --- @@ -192,3 +210,37 @@ setMethod("fitted", signature(object = "PipelineModel"), stop(paste("Unsupported model", modelName, sep = " ")) } }) + +#' Fit a naive Bayes model +#' +#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset' +#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame, +#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential +#' errors. +#' +#' @param object A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param data DataFrame for training +#' @param lambda Smoothing parameter +#' @param modelType Either 'multinomial' or 'bernoulli'. Default "multinomial". +#' @param ... Undefined parameters +#' @return A fitted naive Bayes model. +#' @rdname naiveBayes +#' @export +#' @examples +#'\dontrun{ +#' data(HouseVotes84, package = "mlbench") +#' sc <- sparkR.init() +#' sqlContext <- sparkRSQL.init(sc) +#' df <- createDataFrame(sqlContext, HouseVotes84) +#' model <- glm(Class ~ ., df, lambda = 1, modelType = "multinomial") --- End diff -- shouldn't this be `navieBayes` instead of `glm`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11486#discussion_r55474361 --- Diff: R/pkg/R/generics.R --- @@ -1168,3 +1168,7 @@ setGeneric("kmeans") #' @rdname fitted #' @export setGeneric("fitted") + +#' @rdname naiveBayes +#' @export +setGeneric("naiveBayes", function(formula, data, ...) { standardGeneric("naiveBayes") }) --- End diff -- could you check manually that the http://ugrad.stat.ubc.ca/R/library/e1071/html/naiveBayes.html e1071 package naiveBayes still work when SparkR is loaded after e1071? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11486#discussion_r55474302 --- Diff: R/pkg/R/mllib.R --- @@ -192,3 +210,37 @@ setMethod("fitted", signature(object = "PipelineModel"), stop(paste("Unsupported model", modelName, sep = " ")) } }) + +#' Fit a naive Bayes model +#' +#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset' +#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame, +#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential +#' errors. +#' +#' @param object A symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param data DataFrame for training +#' @param lambda Smoothing parameter +#' @param modelType Either 'multinomial' or 'bernoulli'. Default "multinomial". +#' @param ... Undefined parameters --- End diff -- suggest removing `...` "param" from roxygen2 doc here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11486#discussion_r55474272 --- Diff: R/pkg/R/mllib.R --- @@ -192,3 +210,37 @@ setMethod("fitted", signature(object = "PipelineModel"), stop(paste("Unsupported model", modelName, sep = " ")) } }) + +#' Fit a naive Bayes model +#' +#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset' +#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame, +#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential +#' errors. --- End diff -- what is this potential error? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11486#discussion_r55474264 --- Diff: R/pkg/R/mllib.R --- @@ -192,3 +210,37 @@ setMethod("fitted", signature(object = "PipelineModel"), stop(paste("Unsupported model", modelName, sep = " ")) } }) + +#' Fit a naive Bayes model +#' +#' Fit a naive Bayes model, similarly to R's naiveBayes() except for omitting two arguments 'subset' +#' and 'na.action'. Users can use 'subset' function and 'fillna' or 'na.omit' function of DataFrame, +#' respectviely, to preprocess their DataFrame. We use na.omit in this interface to avoid potential --- End diff -- typo: respectively --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/11220#discussion_r55474213 --- Diff: R/pkg/R/DataFrame.R --- @@ -303,8 +303,28 @@ setMethod("colnames", #' @rdname columns #' @name colnames<- setMethod("colnames<-", - signature(x = "DataFrame", value = "character"), --- End diff -- I think in this case the base:: definition is implemented in such a way that it *should* (or at least it attempts to) handle different parameter types: ``` > getMethod("colnames<-") Method Definition (Class "derivedDefaultMethod"): function (x, value) { if (is.data.frame(x)) { names(x) <- value } else { dn <- dimnames(x) if (is.null(dn)) { if (is.null(value)) return(x) if ((nd <- length(dim(x))) < 2L) stop("attempt to set 'colnames' on an object with less than two dimensions") dn <- vector("list", nd) } if (length(dn) < 2L) stop("attempt to set 'colnames' on an object with less than two dimensions") if (is.null(value)) dn[2L] <- list(NULL) else dn[[2L]] <- value dimnames(x) <- dn } x } Signatures: x value target "ANY" "ANY" defined "ANY" "ANY" ``` So from what I can see the error you see actually comes from the base implement, as "sort of" expected. Though I'm ok if we are to add explicit checks to make the error more clear for the user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194122022 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52720/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194122020 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194121910 **[Test build #52720 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52720/consoleFull)** for PR 11555 at commit [`aa0a32b`](https://github.com/apache/spark/commit/aa0a32b30149620978fbdd26485f01982baa6531). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55473498 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -789,117 +752,193 @@ private[spark] class BlockManager( // lockNewBlockForWriting returned a read lock on the existing block, so we must free it: releaseLock(blockId) } -return DoPutSucceeded +return true } } val startTimeMs = System.currentTimeMillis -// Size of the block in bytes -var size = 0L - -// The level we actually use to put the block -val putLevel = effectiveStorageLevel.getOrElse(level) - -// If we're storing bytes, then initiate the replication before storing them locally. +// Since we're storing bytes, initiate the replication before storing them locally. // This is faster as data is already serialized and ready to send. -val replicationFuture = data match { - case b: ByteBufferValues if putLevel.replication > 1 => -// Duplicate doesn't copy the bytes, but just creates a wrapper -val bufferView = b.buffer.duplicate() -Future { - // This is a blocking action and should run in futureExecutionContext which is a cached - // thread pool - replicate(blockId, bufferView, putLevel) -}(futureExecutionContext) - case _ => null +val replicationFuture = if (level.replication > 1) { + // Duplicate doesn't copy the bytes, but just creates a wrapper + val bufferView = bytes.duplicate() + Future { +// This is a blocking action and should run in futureExecutionContext which is a cached +// thread pool +replicate(blockId, bufferView, level) + }(futureExecutionContext) +} else { + null } var blockWasSuccessfullyStored = false -var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None - -putBlockInfo.synchronized { - logTrace("Put for block %s took %s to get into synchronized block" -.format(blockId, Utils.getUsedTimeMs(startTimeMs))) - try { -if (putLevel.useMemory) { - // Put it in memory first, even if it also has useDisk set to true; - // We will drop it to disk later if the memory store can't hold it. - data match { -case IteratorValues(iterator) => - memoryStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -case Left(iter) => - iteratorFromFailedMemoryStorePut = Some(iter) - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - memoryStore.putBytes(blockId, bytes, putLevel) - } -} else if (putLevel.useDisk) { - data match { -case IteratorValues(iterator) => - diskStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -// putIterator() will never return Left (see its return type). - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - diskStore.putBytes(blockId, bytes, putLevel) - } +bytes.rewind() +val size = bytes.limit() + +try { + if (level.useMemory) { +// Put it in memory first, even if it also has useDisk set to true; +// We will drop it to disk later if the memory store can't hold it. +val putSucceeded = if (level.deserialized) { + val values = dataDeserialize(blockId, bytes.duplicate()) + memoryStore.putIterator(blockId, values, level).isRight } else { - assert(putLevel == StorageLevel.NONE) - throw new BlockException( -blockId, s"Attempted to put block $blockId without specifying storage level!") + memoryStore.putBytes(blockId, size, () => bytes) +} +if (!putSucceeded && level.useDisk) { + logWarning(s"Persisting block $blockId to disk instead.") + diskStore.putBytes(blockId, bytes) } + } else if (level.useDisk) { +diskStore.putBytes(blockId, bytes) + } -val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo) -blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid -if (blockWasSuccessfullyStored) { -
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55473363 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -789,117 +752,193 @@ private[spark] class BlockManager( // lockNewBlockForWriting returned a read lock on the existing block, so we must free it: releaseLock(blockId) } -return DoPutSucceeded +return true } } val startTimeMs = System.currentTimeMillis -// Size of the block in bytes -var size = 0L - -// The level we actually use to put the block -val putLevel = effectiveStorageLevel.getOrElse(level) - -// If we're storing bytes, then initiate the replication before storing them locally. +// Since we're storing bytes, initiate the replication before storing them locally. // This is faster as data is already serialized and ready to send. -val replicationFuture = data match { - case b: ByteBufferValues if putLevel.replication > 1 => -// Duplicate doesn't copy the bytes, but just creates a wrapper -val bufferView = b.buffer.duplicate() -Future { - // This is a blocking action and should run in futureExecutionContext which is a cached - // thread pool - replicate(blockId, bufferView, putLevel) -}(futureExecutionContext) - case _ => null +val replicationFuture = if (level.replication > 1) { + // Duplicate doesn't copy the bytes, but just creates a wrapper + val bufferView = bytes.duplicate() + Future { +// This is a blocking action and should run in futureExecutionContext which is a cached +// thread pool +replicate(blockId, bufferView, level) + }(futureExecutionContext) +} else { + null } var blockWasSuccessfullyStored = false -var iteratorFromFailedMemoryStorePut: Option[Iterator[Any]] = None - -putBlockInfo.synchronized { - logTrace("Put for block %s took %s to get into synchronized block" -.format(blockId, Utils.getUsedTimeMs(startTimeMs))) - try { -if (putLevel.useMemory) { - // Put it in memory first, even if it also has useDisk set to true; - // We will drop it to disk later if the memory store can't hold it. - data match { -case IteratorValues(iterator) => - memoryStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -case Left(iter) => - iteratorFromFailedMemoryStorePut = Some(iter) - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - memoryStore.putBytes(blockId, bytes, putLevel) - } -} else if (putLevel.useDisk) { - data match { -case IteratorValues(iterator) => - diskStore.putIterator(blockId, iterator(), putLevel) match { -case Right(s) => - size = s -// putIterator() will never return Left (see its return type). - } -case ByteBufferValues(bytes) => - bytes.rewind() - size = bytes.limit() - diskStore.putBytes(blockId, bytes, putLevel) - } +bytes.rewind() +val size = bytes.limit() + +try { + if (level.useMemory) { +// Put it in memory first, even if it also has useDisk set to true; +// We will drop it to disk later if the memory store can't hold it. +val putSucceeded = if (level.deserialized) { + val values = dataDeserialize(blockId, bytes.duplicate()) + memoryStore.putIterator(blockId, values, level).isRight } else { - assert(putLevel == StorageLevel.NONE) - throw new BlockException( -blockId, s"Attempted to put block $blockId without specifying storage level!") + memoryStore.putBytes(blockId, size, () => bytes) +} +if (!putSucceeded && level.useDisk) { + logWarning(s"Persisting block $blockId to disk instead.") + diskStore.putBytes(blockId, bytes) } + } else if (level.useDisk) { +diskStore.putBytes(blockId, bytes) + } -val putBlockStatus = getCurrentBlockStatus(blockId, putBlockInfo) -blockWasSuccessfullyStored = putBlockStatus.storageLevel.isValid -if (blockWasSuccessfullyStored) { -
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11486#issuecomment-194117566 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11486#issuecomment-194117569 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52723/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11486#issuecomment-194117451 **[Test build #52723 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52723/consoleFull)** for PR 11486 at commit [`30e9c37`](https://github.com/apache/spark/commit/30e9c372207ed206a7dc294b5726ad008a18ed12). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55472870 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskStore.scala --- @@ -17,112 +17,100 @@ package org.apache.spark.storage -import java.io.{File, FileOutputStream, IOException, RandomAccessFile} +import java.io.{FileOutputStream, IOException, RandomAccessFile} import java.nio.ByteBuffer import java.nio.channels.FileChannel.MapMode -import org.apache.spark.Logging +import com.google.common.io.Closeables + +import org.apache.spark.{Logging, SparkConf} import org.apache.spark.util.Utils /** * Stores BlockManager blocks on disk. */ -private[spark] class DiskStore(blockManager: BlockManager, diskManager: DiskBlockManager) - extends BlockStore(blockManager) with Logging { +private[spark] class DiskStore(conf: SparkConf, diskManager: DiskBlockManager) extends Logging { - val minMemoryMapBytes = blockManager.conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m") + private val minMemoryMapBytes = conf.getSizeAsBytes("spark.storage.memoryMapThreshold", "2m") - override def getSize(blockId: BlockId): Long = { + def getSize(blockId: BlockId): Long = { diskManager.getFile(blockId.name).length } - override def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): Unit = { -// So that we do not modify the input offsets ! -// duplicate does not copy buffer, so inexpensive -val bytes = _bytes.duplicate() + /** + * Invokes the provided callback function to write the specific block. + * + * @throws IllegalStateException if the block already exists in the disk store. + */ + def put(blockId: BlockId)(writeFunc: FileOutputStream => Unit): Unit = { +if (contains(blockId)) { + throw new IllegalStateException(s"Block $blockId is already present in the disk store") +} logDebug(s"Attempting to put block $blockId") val startTime = System.currentTimeMillis val file = diskManager.getFile(blockId) -val channel = new FileOutputStream(file).getChannel -Utils.tryWithSafeFinally { - while (bytes.remaining > 0) { -channel.write(bytes) +val fileOutputStream = new FileOutputStream(file) +var threwException: Boolean = true +try { + writeFunc(fileOutputStream) + threwException = false +} finally { + try { +Closeables.close(fileOutputStream, threwException) + } finally { + if (threwException) { + remove(blockId) +} } -} { - channel.close() } val finishTime = System.currentTimeMillis logDebug("Block %s stored as %s file on disk in %d ms".format( - file.getName, Utils.bytesToString(bytes.limit), finishTime - startTime)) + file.getName, + Utils.bytesToString(file.length()), + finishTime - startTime)) } - override def putIterator( - blockId: BlockId, - values: Iterator[Any], - level: StorageLevel): Right[Iterator[Any], Long] = { -logDebug(s"Attempting to write values for block $blockId") -val startTime = System.currentTimeMillis -val file = diskManager.getFile(blockId) -val outputStream = new FileOutputStream(file) -try { + def putBytes(blockId: BlockId, _bytes: ByteBuffer): Unit = { --- End diff -- Comment function and guarantees on _bytes on return. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11599#issuecomment-194115203 **[Test build #52724 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52724/consoleFull)** for PR 11599 at commit [`a31b1b5`](https://github.com/apache/spark/commit/a31b1b588949f2f92981f7d1a7d04d6e1806ccd1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194114531 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194114532 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52718/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12718][SPARK-13720][SQL] SQL generation...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11555#issuecomment-194113999 **[Test build #52718 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52718/consoleFull)** for PR 11555 at commit [`dab7a2f`](https://github.com/apache/spark/commit/dab7a2f1a5cc0438405b0fa1cf532ab883bed7e7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55472267 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala --- @@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest { comparePlans(Optimize.execute(query), expected) } + test("Eliminate the Project with an empty projectList") { +val input = OneRowRelation +val query = + Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze --- End diff -- Let me add an empty List too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55472012 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala --- @@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest { comparePlans(Optimize.execute(query), expected) } + test("Eliminate the Project with an empty projectList") { +val input = OneRowRelation +val query = + Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze --- End diff -- When running `Optimize.execute(query)`, the second Project's `projectList` is pruned to empty at first. Then, the second Project will be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13656][SQL] Remove spark.sql.parquet.ca...
Github user maropu commented on the pull request: https://github.com/apache/spark/pull/11576#issuecomment-194112721 Yes, ... Scanning all files are pretty expensive though, we can add a metadata file for detecting it; when `DataFrameWriter` writes something in the files, it updates the metadata. Then, `ParquetRelation` checks the metadata to detect updates. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13763] [SQL] Remove Project when its pr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/11599#discussion_r55471810 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ColumnPruningSuite.scala --- @@ -157,6 +157,14 @@ class ColumnPruningSuite extends PlanTest { comparePlans(Optimize.execute(query), expected) } + test("Eliminate the Project with an empty projectList") { +val input = OneRowRelation +val query = + Project(Literal(1).as("1") :: Nil, Project(Literal(1).as("1") :: Nil, input)).analyze --- End diff -- Where do you test empty projectList? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/11592#issuecomment-194112069 This looks pretty good. Just some minor comments on naming. It'd be great to be more consistent among "branch", "case", and "switch". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13327][SPARKR] Added parameter validati...
Github user sun-rui commented on the pull request: https://github.com/apache/spark/pull/11220#issuecomment-194111890 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13696] Remove BlockStore class & simpli...
Github user nongli commented on a diff in the pull request: https://github.com/apache/spark/pull/11534#discussion_r55471546 --- Diff: core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala --- @@ -34,13 +35,15 @@ private case class MemoryEntry(value: Any, size: Long, deserialized: Boolean) * Stores blocks in memory, either as Arrays of deserialized Java objects or as * serialized ByteBuffers. */ -private[spark] class MemoryStore(blockManager: BlockManager, memoryManager: MemoryManager) - extends BlockStore(blockManager) { +private[spark] class MemoryStore( +conf: SparkConf, +blockManager: BlockManager, +memoryManager: MemoryManager) + extends Logging { // Note: all changes to memory allocations, notably putting blocks, evicting blocks, and // acquiring or releasing unroll memory, must be synchronized on `memoryManager`! - private val conf = blockManager.conf private val entries = new LinkedHashMap[BlockId, MemoryEntry](32, 0.75f, true) --- End diff -- Hmm okay. DiskStore also needs to be thread safe right? It just happens it doesn't maintain any state right now that's interesting. I think it's worth commenting so people modifying this know what the expected properties are. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11592#discussion_r55471524 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala --- @@ -58,6 +58,27 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { } } + test("SPARK-13242: complicated case-when expressions") { --- End diff -- SPARK-13242: case-when expression with large number of branches (or cases)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11592#discussion_r55471486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, Expression)], elseValue: Option[E } } + def shouldCodegen: Boolean = { +branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES + } + override def genCode(ctx: CodegenContext, ev: ExprCode): String = { +if (!shouldCodegen) { + // Fallback to interpreted mode if there are too many branches, or it may reach the + // 64K limit (number of bytecode for single Java method). + return super.genCode(ctx, ev) --- End diff -- this super is slightly brittle. Can you see if there is a way in Scala to explicitly call genCode in CodegenFallback? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11592#discussion_r55471461 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, Expression)], elseValue: Option[E } } + def shouldCodegen: Boolean = { +branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES + } + override def genCode(ctx: CodegenContext, ev: ExprCode): String = { +if (!shouldCodegen) { + // Fallback to interpreted mode if there are too many branches, or it may reach the + // 64K limit (number of bytecode for single Java method). --- End diff -- number of bytecode for single Java method -> limit on bytecode size for a single function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11592#discussion_r55471445 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -136,7 +136,16 @@ case class CaseWhen(branches: Seq[(Expression, Expression)], elseValue: Option[E } } + def shouldCodegen: Boolean = { +branches.length < CaseWhen.MAX_NUMBER_OF_SWITCHES + } + override def genCode(ctx: CodegenContext, ev: ExprCode): String = { +if (!shouldCodegen) { + // Fallback to interpreted mode if there are too many branches, or it may reach the --- End diff -- or -> as --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13449] Naive Bayes wrapper in SparkR
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11486#issuecomment-194111388 **[Test build #52723 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52723/consoleFull)** for PR 11486 at commit [`30e9c37`](https://github.com/apache/spark/commit/30e9c372207ed206a7dc294b5726ad008a18ed12). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13242] [SQL] codegen fallback in case-w...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/11592#discussion_r55471424 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -205,6 +214,9 @@ case class CaseWhen(branches: Seq[(Expression, Expression)], elseValue: Option[E /** Factory methods for CaseWhen. */ object CaseWhen { + // The maxium number of switches supported with codegen. + val MAX_NUMBER_OF_SWITCHES = 20 --- End diff -- MAX_NUM_CASES_FOR_CODEGEN --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55471264 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala --- @@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { "f1", "b[0].f1", "f1", "c[foo]", "d[0]" ) } + + test("SQL generator for explode in projection list") { +// Basic Explode +checkHiveQl("SELECT explode(array(1,2,3)) FROM src") + +// Explode with Alias +checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src") + +// Explode without FROM +checkHiveQl("select explode(array(1,2,3)) AS gencol") + +// non-generated columns in projection list +checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as c3 FROM t1") + } + + test("SQL generation for json_tuple as generator") { +checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5') FROM parquet_t3") + } + + test("SQL generation for lateral views") { --- End diff -- Please add another test case for joining the lateral view with a regular table. Thanks! ```LATERAL VIEW .., t1 as table1``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55471184 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/LogicalPlanToSQLSuite.scala --- @@ -445,4 +461,98 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { "f1", "b[0].f1", "f1", "c[foo]", "d[0]" ) } + + test("SQL generator for explode in projection list") { +// Basic Explode +checkHiveQl("SELECT explode(array(1,2,3)) FROM src") + +// Explode with Alias +checkHiveQl("SELECT explode(array(1,2,3)) as value FROM src") + +// Explode without FROM +checkHiveQl("select explode(array(1,2,3)) AS gencol") + +// non-generated columns in projection list +checkHiveQl("SELECT key as c1, explode(array(1,2,3)) as c2, value as c3 FROM t1") + } + + test("SQL generation for json_tuple as generator") { +checkHiveQl("SELECT key, json_tuple(jstring, 'f1', 'f2', 'f3', 'f4', 'f5') FROM parquet_t3") + } + + test("SQL generation for lateral views") { +// Filter and OUTER clause +checkHiveQl( + """ +|SELECT key, value +|FROM t1 +|LATERAL VIEW OUTER explode(value) gentab as gencol +|WHERE key = 1 + """.stripMargin +) + +// single lateral view +checkHiveQl( + """ +|SELECT * +|FROM t1 +|LATERAL VIEW explode(array(1,2,3)) gentab AS gencol +|SORT BY key ASC, gencol ASC LIMIT 1 + """.stripMargin +) + +// multiple lateral views +checkHiveQl( + """ +|SELECT gentab2.* +|FROM t1 +|LATERAL VIEW explode(array(array(1,2,3))) gentab1 AS gencol1 +|LATERAL VIEW explode(gentab1.gencol1) gentab2 AS gencol2 LIMIT 3 + """.stripMargin +) + +// No generated column aliases +checkHiveQl( + """SELECT gentab.* +|FROM t1 +|LATERAL VIEW explode(map('key1', 100, 'key2', 200)) gentab limit 2 + """.stripMargin +) + } + + test("SQL generation for lateral views in subquery") { +// Subquries in FROM clause using Generate +checkHiveQl( + """ +|SELECT subq.gencol +|FROM +|(SELECT * from t1 LATERAL VIEW explode(value) gentab AS gencol) subq + """.stripMargin) + +checkHiveQl( + """ +|SELECT subq.key +|FROM +|(SELECT key, value from t1 LATERAL VIEW explode(value) gentab AS gencol) subq + """.stripMargin +) + } + + test("SQL generation for UDTF") { --- End diff -- Please add more UDTF cases, as what @cloud-fan suggested. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12719][SQL] [WIP] SQL generation suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/11596#discussion_r55471043 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/SQLBuilder.scala --- @@ -297,6 +303,60 @@ class SQLBuilder(logicalPlan: LogicalPlan, sqlContext: SQLContext) extends Loggi ) } + /* This function handles the SQL generation when generators. --- End diff -- style issue: please use `/**` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org