spark git commit: [SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb
Repository: spark Updated Branches: refs/heads/master bb220f657 - 708036c1d [SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb https://issues.apache.org/jira/browse/SPARK-9439 In general, Yarn apps should be robust to NodeManager restarts. However, if you run spark with the external shuffle service on, after a NM restart all shuffles fail, b/c the shuffle service has lost some state with info on each executor. (Note the shuffle data is perfectly fine on disk across a NM restart, the problem is we've lost the small bit of state that lets us *find* those files.) The solution proposed here is that the external shuffle service can write out its state to leveldb (backed by a local file) every time an executor is added. When running with yarn, that file is in the NM's local dir. Whenever the service is started, it looks for that file, and if it exists, it reads the file and re-registers all executors there. Nothing is changed in non-yarn modes with this patch. The service is not given a place to save the state to, so it operates the same as before. This should make it easy to update other cluster managers as well, by just supplying the right file the equivalent of yarn's `initializeApplication` -- I'm not familiar enough with those modes to know how to do that. Author: Imran Rashid iras...@cloudera.com Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and squashes the following commits: 0d285d3 [Imran Rashid] review feedback 70951d6 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 5c71c8c [Imran Rashid] save executor to db before registering; style 2499c8c [Imran Rashid] explicit dependency on jackson-annotations 795d28f [Imran Rashid] review feedback 81f80e2 [Imran Rashid] Merge branch 'master' into leveldb_external_shuffle_service_NM_restart 594d520 [Imran Rashid] use json to serialize application executor info 1a7980b [Imran Rashid] version 8267d2a [Imran Rashid] style e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little 9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files acedb62 [Imran Rashid] switch to writing out one record per executor 79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup 12b6a35 [Imran Rashid] save registered executors when apps are removed; add tests c878fbe [Imran Rashid] better explanation of shuffle service port handling 694934c [Imran Rashid] only open leveldb connection once per service d596410 [Imran Rashid] store executor data in leveldb 59800b7 [Imran Rashid] Files.move in case renaming is unsupported 32fe5ae [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart d7450f0 [Imran Rashid] style f729e2b [Imran Rashid] debugging 4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks 0a39b98 [Imran Rashid] Merge branch 'master' into external_shuffle_service_NM_restart 55f49fc [Imran Rashid] make sure the service doesnt die if the registered executor file is corrupt; add tests 245db19 [Imran Rashid] style 62586a6 [Imran Rashid] just serialize the whole executors map bdbbf0d [Imran Rashid] comments, remove some unnecessary changes 857331a [Imran Rashid] better tests comments bb9d1e6 [Imran Rashid] formatting bdc4b32 [Imran Rashid] rename 86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port 23994ff [Imran Rashid] style 7504de8 [Imran Rashid] style a36729c [Imran Rashid] cleanup efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during NM restart dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts d596969 [Imran Rashid] cleanup imports 0e9d69b [Imran Rashid] better names 9eae119 [Imran Rashid] cleanup lots of duplication 1136f44 [Imran Rashid] test needs to have an actual shuffle 0b588bd [Imran Rashid] more fixes ... ad122ef [Imran Rashid] more fixes 5e5a7c3 [Imran Rashid] fix build c69f46b [Imran Rashid] maybe working version, needs tests cleanup ... bb3ba49 [Imran Rashid] minor cleanup 36127d3 [Imran Rashid] wip b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/708036c1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/708036c1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/708036c1 Branch: refs/heads/master Commit: 708036c1de52d674ceff30ac465e1dcedeb8dde8 Parents: bb220f6 Author: Imran Rashid iras...@cloudera.com Authored: Fri Aug 21 08:41:36 2015 -0500 Committer: Tom Graves tgra...@yahoo-inc.com Committed: Fri Aug 21 08:41:36 2015 -0500 -- .../spark/deploy/ExternalShuffleService.scala | 2 +- .../mesos/MesosExternalShuffleService.scala | 2 +- .../org/apache/spark/storage/BlockManager.scala | 14 +- .../spark/ExternalShuffleServiceSuite.scala | 2 +- network/shuffle/pom.xml
spark git commit: [SPARK-10130] [SQL] type coercion for IF should have children resolved first
Repository: spark Updated Branches: refs/heads/branch-1.5 e5e601739 - 817c38a0a [SPARK-10130] [SQL] type coercion for IF should have children resolved first Type coercion for IF should have children resolved first, or we could meet unresolved exception. Author: Daoyuan Wang daoyuan.w...@intel.com Closes #8331 from adrian-wang/spark10130. (cherry picked from commit 3c462f5d87a9654c5a68fd658a40f5062029fd9a) Signed-off-by: Michael Armbrust mich...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/817c38a0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/817c38a0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/817c38a0 Branch: refs/heads/branch-1.5 Commit: 817c38a0a1405c2bf407070e13e16934c777cd89 Parents: e5e6017 Author: Daoyuan Wang daoyuan.w...@intel.com Authored: Fri Aug 21 12:21:51 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Fri Aug 21 12:22:08 2015 -0700 -- .../apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala | 1 + .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 7 +++ 2 files changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/817c38a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala index f2f2ba2..2cb067f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala @@ -639,6 +639,7 @@ object HiveTypeCoercion { */ object IfCoercion extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions { + case e if !e.childrenResolved = e // Find tightest common type for If, if the true value and false value have different types. case i @ If(pred, left, right) if left.dataType != right.dataType = findTightestCommonTypeToString(left.dataType, right.dataType).map { widestType = http://git-wip-us.apache.org/repos/asf/spark/blob/817c38a0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index da50aec..dcb4e83 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -1679,4 +1679,11 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sqlContext.table(`db.t`), df) } } + + test(SPARK-10130 type coercion for IF should have children resolved first) { +val df = Seq((1, 1), (-1, 1)).toDF(key, value) +df.registerTempTable(src) +checkAnswer( + sql(SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp), Seq(Row(1), Row(0))) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10130] [SQL] type coercion for IF should have children resolved first
Repository: spark Updated Branches: refs/heads/master 708036c1d - 3c462f5d8 [SPARK-10130] [SQL] type coercion for IF should have children resolved first Type coercion for IF should have children resolved first, or we could meet unresolved exception. Author: Daoyuan Wang daoyuan.w...@intel.com Closes #8331 from adrian-wang/spark10130. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c462f5d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c462f5d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c462f5d Branch: refs/heads/master Commit: 3c462f5d87a9654c5a68fd658a40f5062029fd9a Parents: 708036c Author: Daoyuan Wang daoyuan.w...@intel.com Authored: Fri Aug 21 12:21:51 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Fri Aug 21 12:21:51 2015 -0700 -- .../apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala | 1 + .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 7 +++ 2 files changed, 8 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3c462f5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala index f2f2ba2..2cb067f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala @@ -639,6 +639,7 @@ object HiveTypeCoercion { */ object IfCoercion extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions { + case e if !e.childrenResolved = e // Find tightest common type for If, if the true value and false value have different types. case i @ If(pred, left, right) if left.dataType != right.dataType = findTightestCommonTypeToString(left.dataType, right.dataType).map { widestType = http://git-wip-us.apache.org/repos/asf/spark/blob/3c462f5d/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index da50aec..dcb4e83 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -1679,4 +1679,11 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sqlContext.table(`db.t`), df) } } + + test(SPARK-10130 type coercion for IF should have children resolved first) { +val df = Seq((1, 1), (-1, 1)).toDF(key, value) +df.registerTempTable(src) +checkAnswer( + sql(SELECT IF(a 0, a, 0) FROM (SELECT key a FROM src) temp), Seq(Row(1), Row(0))) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function
Repository: spark Updated Branches: refs/heads/branch-1.5 817c38a0a - 4e72839b7 [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122). tdas , please help to review. Author: jerryshao ss...@hortonworks.com Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits: 4039b16 [jerryshao] Fix getOffsetRanges in transform() bug Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e72839b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e72839b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e72839b Branch: refs/heads/branch-1.5 Commit: 4e72839b7b1e0b925837b49534a07188a603d838 Parents: 817c38a Author: jerryshao ss...@hortonworks.com Authored: Fri Aug 21 13:10:11 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Fri Aug 21 13:17:48 2015 -0700 -- python/pyspark/streaming/dstream.py | 5 - python/pyspark/streaming/tests.py | 4 +++- 2 files changed, 7 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4e72839b/python/pyspark/streaming/dstream.py -- diff --git a/python/pyspark/streaming/dstream.py b/python/pyspark/streaming/dstream.py index 8dcb964..698336c 100644 --- a/python/pyspark/streaming/dstream.py +++ b/python/pyspark/streaming/dstream.py @@ -610,7 +610,10 @@ class TransformedDStream(DStream): self.is_checkpointed = False self._jdstream_val = None -if (isinstance(prev, TransformedDStream) and +# Using type() to avoid folding the functions and compacting the DStreams which is not +# not strictly a object of TransformedDStream. +# Changed here is to avoid bug in KafkaTransformedDStream when calling offsetRanges(). +if (type(prev) is TransformedDStream and not prev.is_cached and not prev.is_checkpointed): prev_func = prev.func self.func = lambda t, rdd: func(t, prev_func(t, rdd)) http://git-wip-us.apache.org/repos/asf/spark/blob/4e72839b/python/pyspark/streaming/tests.py -- diff --git a/python/pyspark/streaming/tests.py b/python/pyspark/streaming/tests.py index 6108c84..214d5be 100644 --- a/python/pyspark/streaming/tests.py +++ b/python/pyspark/streaming/tests.py @@ -850,7 +850,9 @@ class KafkaStreamTests(PySparkStreamingTestCase): offsetRanges.append(o) return rdd -stream.transform(transformWithOffsetRanges).foreachRDD(lambda rdd: rdd.count()) +# Test whether it is ok mixing KafkaTransformedDStream and TransformedDStream together, +# only the TransformedDstreams can be folded together. +stream.transform(transformWithOffsetRanges).map(lambda kv: kv[1]).count().pprint() self.ssc.start() self.wait_for(offsetRanges, 1) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[3/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation
[SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation Author: MechCoder manojkumarsivaraj...@gmail.com Closes #8352 from MechCoder/since. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5b028ed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5b028ed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5b028ed Branch: refs/heads/master Commit: f5b028ed2f1ad6de43c8b50ebf480e1b6c047035 Parents: d89cc38 Author: MechCoder manojkumarsivaraj...@gmail.com Authored: Fri Aug 21 14:19:24 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 21 14:19:24 2015 -0700 -- .../classification/ClassificationModel.scala| 8 +- .../classification/LogisticRegression.scala | 30 +++--- .../spark/mllib/classification/NaiveBayes.scala | 7 +- .../apache/spark/mllib/classification/SVM.scala | 28 ++--- .../mllib/clustering/GaussianMixture.scala | 28 ++--- .../mllib/clustering/GaussianMixtureModel.scala | 28 ++--- .../apache/spark/mllib/clustering/KMeans.scala | 50 - .../spark/mllib/clustering/KMeansModel.scala| 27 ++--- .../org/apache/spark/mllib/clustering/LDA.scala | 56 +- .../spark/mllib/clustering/LDAModel.scala | 69 +--- .../spark/mllib/clustering/LDAOptimizer.scala | 24 ++--- .../clustering/PowerIterationClustering.scala | 38 +++ .../mllib/clustering/StreamingKMeans.scala | 35 +++--- .../BinaryClassificationMetrics.scala | 26 ++--- .../mllib/evaluation/MulticlassMetrics.scala| 20 ++-- .../mllib/evaluation/MultilabelMetrics.scala| 9 +- .../spark/mllib/evaluation/RankingMetrics.scala | 10 +- .../mllib/evaluation/RegressionMetrics.scala| 14 +-- .../spark/mllib/fpm/AssociationRules.scala | 20 ++-- .../org/apache/spark/mllib/fpm/FPGrowth.scala | 22 ++-- .../apache/spark/mllib/linalg/Matrices.scala| 106 --- .../linalg/SingularValueDecomposition.scala | 4 +- .../org/apache/spark/mllib/linalg/Vectors.scala | 90 +--- .../mllib/linalg/distributed/BlockMatrix.scala | 88 +++ .../linalg/distributed/CoordinateMatrix.scala | 40 +++ .../linalg/distributed/DistributedMatrix.scala | 4 +- .../linalg/distributed/IndexedRowMatrix.scala | 38 +++ .../mllib/linalg/distributed/RowMatrix.scala| 39 +++ .../apache/spark/mllib/recommendation/ALS.scala | 22 ++-- .../MatrixFactorizationModel.scala | 28 +++-- .../regression/GeneralizedLinearAlgorithm.scala | 24 ++--- .../mllib/regression/IsotonicRegression.scala | 22 ++-- .../spark/mllib/regression/LabeledPoint.scala | 7 +- .../apache/spark/mllib/regression/Lasso.scala | 25 ++--- .../mllib/regression/LinearRegression.scala | 25 ++--- .../mllib/regression/RegressionModel.scala | 12 +-- .../mllib/regression/RidgeRegression.scala | 25 ++--- .../regression/StreamingLinearAlgorithm.scala | 18 ++-- .../apache/spark/mllib/stat/KernelDensity.scala | 12 +-- .../stat/MultivariateOnlineSummarizer.scala | 24 ++--- .../stat/MultivariateStatisticalSummary.scala | 19 ++-- .../apache/spark/mllib/stat/Statistics.scala| 30 +++--- .../distribution/MultivariateGaussian.scala | 8 +- .../apache/spark/mllib/tree/DecisionTree.scala | 28 +++-- .../spark/mllib/tree/GradientBoostedTrees.scala | 20 ++-- .../apache/spark/mllib/tree/RandomForest.scala | 20 ++-- .../spark/mllib/tree/configuration/Algo.scala | 4 +- .../tree/configuration/BoostingStrategy.scala | 12 +-- .../mllib/tree/configuration/FeatureType.scala | 4 +- .../tree/configuration/QuantileStrategy.scala | 4 +- .../mllib/tree/configuration/Strategy.scala | 24 ++--- .../spark/mllib/tree/impurity/Entropy.scala | 10 +- .../apache/spark/mllib/tree/impurity/Gini.scala | 10 +- .../spark/mllib/tree/impurity/Impurity.scala| 8 +- .../spark/mllib/tree/impurity/Variance.scala| 10 +- .../spark/mllib/tree/loss/AbsoluteError.scala | 6 +- .../apache/spark/mllib/tree/loss/LogLoss.scala | 6 +- .../org/apache/spark/mllib/tree/loss/Loss.scala | 8 +- .../apache/spark/mllib/tree/loss/Losses.scala | 10 +- .../spark/mllib/tree/loss/SquaredError.scala| 6 +- .../mllib/tree/model/DecisionTreeModel.scala| 22 ++-- .../mllib/tree/model/InformationGainStats.scala | 4 +- .../apache/spark/mllib/tree/model/Node.scala| 8 +- .../apache/spark/mllib/tree/model/Predict.scala | 4 +- .../apache/spark/mllib/tree/model/Split.scala | 4 +- .../mllib/tree/model/treeEnsembleModels.scala | 26 +++-- .../org/apache/spark/mllib/tree/package.scala | 1 - .../org/apache/spark/mllib/util/MLUtils.scala | 36 +++ 68 files changed, 692 insertions(+), 862 deletions(-)
[2/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.
http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt new file mode 100644 index 000..95f80d8 --- /dev/null +++ b/CHANGES.txt @@ -0,0 +1,24649 @@ +Spark Change Log + + +Release 1.5.0 + + [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary. + Yin Huai yh...@databricks.com + 2015-08-21 14:30:00 -0700 + Commit: 14c8c0c, github.com/apache/spark/pull/8346 + + [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation + MechCoder manojkumarsivaraj...@gmail.com + 2015-08-21 14:19:24 -0700 + Commit: e7db876, github.com/apache/spark/pull/8352 + + [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function + jerryshao ss...@hortonworks.com + 2015-08-21 13:10:11 -0700 + Commit: 4e72839, github.com/apache/spark/pull/8347 + + [SPARK-10130] [SQL] type coercion for IF should have children resolved first + Daoyuan Wang daoyuan.w...@intel.com + 2015-08-21 12:21:51 -0700 + Commit: 817c38a, github.com/apache/spark/pull/8331 + + [SPARK-9846] [DOCS] User guide for Multilayer Perceptron Classifier + Alexander Ulanov na...@yandex.ru + 2015-08-20 20:02:27 -0700 + Commit: e5e6017, github.com/apache/spark/pull/8262 + + [SPARK-10140] [DOC] add target fields to @Since + Xiangrui Meng m...@databricks.com + 2015-08-20 20:01:13 -0700 + Commit: 04ef52a, github.com/apache/spark/pull/8344 + + Preparing development version 1.5.1-SNAPSHOT + Patrick Wendell pwend...@gmail.com + 2015-08-20 16:24:12 -0700 + Commit: 988e838 + + Preparing Spark release v1.5.0-rc1 + Patrick Wendell pwend...@gmail.com + 2015-08-20 16:24:07 -0700 + Commit: 4c56ad7 + + Preparing development version 1.5.0-SNAPSHOT + Patrick Wendell pwend...@gmail.com + 2015-08-20 15:33:10 -0700 + Commit: 175c1d9 + + Preparing Spark release v1.5.0-rc1 + Patrick Wendell pwend...@gmail.com + 2015-08-20 15:33:04 -0700 + Commit: d837d51 + + [SPARK-9245] [MLLIB] LDA topic assignments + Joseph K. Bradley jos...@databricks.com + 2015-08-20 15:01:31 -0700 + Commit: 2beea65, github.com/apache/spark/pull/8329 + + [SPARK-10108] Add since tags to mllib.feature + MechCoder manojkumarsivaraj...@gmail.com + 2015-08-20 14:56:08 -0700 + Commit: 560ec12, github.com/apache/spark/pull/8309 + + [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add Java test suite + Xiangrui Meng m...@databricks.com + 2015-08-20 14:47:04 -0700 + Commit: 2e0d2a9, github.com/apache/spark/pull/8342 + + Preparing development version 1.5.0-SNAPSHOT + Patrick Wendell pwend...@gmail.com + 2015-08-20 12:43:13 -0700 + Commit: eac31ab + + Preparing Spark release v1.5.0-rc1 + Patrick Wendell pwend...@gmail.com + 2015-08-20 12:43:08 -0700 + Commit: 99eeac8 + + [SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke snapshot publishing for Scala 2.11 + Josh Rosen joshro...@databricks.com + 2015-08-20 11:31:03 -0700 + Commit: 6026f4f, github.com/apache/spark/pull/8325 + + Preparing development version 1.5.0-SNAPSHOT + Patrick Wendell pwend...@gmail.com + 2015-08-20 11:06:41 -0700 + Commit: a1785e3 + + Preparing Spark release v1.5.0-rc1 + Patrick Wendell pwend...@gmail.com + 2015-08-20 11:06:31 -0700 + Commit: 19b92c8 + + [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array + Cheng Lian l...@databricks.com + 2015-08-20 11:00:24 -0700 + Commit: 2f47e09, github.com/apache/spark/pull/8341 + + [SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type + Alex Shkurenko ashkure...@enova.com + 2015-08-20 10:16:38 -0700 + Commit: a7027e6, github.com/apache/spark/pull/8239 + + [MINOR] [SQL] Fix sphinx warnings in PySpark SQL + MechCoder manojkumarsivaraj...@gmail.com + 2015-08-20 10:05:31 -0700 + Commit: 257e9d7, github.com/apache/spark/pull/8171 + + [SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key in aggregation. + Reynold Xin r...@databricks.com + 2015-08-20 07:53:27 -0700 + Commit: 5be5175, github.com/apache/spark/pull/8332 + + [SPARK-10092] [SQL] Backports #8324 to branch-1.5 + Yin Huai yh...@databricks.com + 2015-08-20 18:43:24 +0800 + Commit: 675e224, github.com/apache/spark/pull/8336 + + [SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data + Tathagata Das tathagata.das1...@gmail.com + 2015-08-19 21:15:58 -0700 + Commit: 71aa547, github.com/apache/spark/pull/8328 + + [SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop + zsxwing zsxw...@gmail.com + 2015-08-19 19:43:09 -0700 + Commit: 63922fa, github.com/apache/spark/pull/8326 + + [SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode. + Timothy Chen tnac...@gmail.com + 2015-08-19 19:43:26 -0700 + Commit: a3ed2c3, github.com/apache/spark/pull/8322 + + [SPARK-9812]
[1/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.
Repository: spark Updated Branches: refs/heads/branch-1.5 14c8c0c0d - f65759e3a http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/core/src/main/scala/org/apache/spark/package.scala -- diff --git a/core/src/main/scala/org/apache/spark/package.scala b/core/src/main/scala/org/apache/spark/package.scala index 8ae76c5..5e3b75b 100644 --- a/core/src/main/scala/org/apache/spark/package.scala +++ b/core/src/main/scala/org/apache/spark/package.scala @@ -42,6 +42,5 @@ package org.apache */ package object spark { - // For package docs only - val SPARK_VERSION = 1.5.0-SNAPSHOT + val SPARK_VERSION = 1.5.0 } http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/dev/create-release/generate-changelist.py -- diff --git a/dev/create-release/generate-changelist.py b/dev/create-release/generate-changelist.py index 2e1a35a..37e5651 100755 --- a/dev/create-release/generate-changelist.py +++ b/dev/create-release/generate-changelist.py @@ -31,8 +31,8 @@ import time import traceback SPARK_HOME = os.environ[SPARK_HOME] -NEW_RELEASE_VERSION = 1.0.0 -PREV_RELEASE_GIT_TAG = v0.9.1 +NEW_RELEASE_VERSION = 1.5.0 +PREV_RELEASE_GIT_TAG = v1.4.0 CHANGELIST = CHANGES.txt OLD_CHANGELIST = %s.old % (CHANGELIST) http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/docs/_config.yml -- diff --git a/docs/_config.yml b/docs/_config.yml index c0e031a..e3a447e 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -14,7 +14,7 @@ include: # These allow the documentation to be updated with newer releases # of Spark, Scala, and Mesos. -SPARK_VERSION: 1.5.0-SNAPSHOT +SPARK_VERSION: 1.5.0 SPARK_VERSION_SHORT: 1.5.0 SCALA_BINARY_VERSION: 2.10 SCALA_VERSION: 2.10.4 http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/ec2/spark_ec2.py -- diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py index 11fd7ee..ccc897f 100755 --- a/ec2/spark_ec2.py +++ b/ec2/spark_ec2.py @@ -51,7 +51,7 @@ else: raw_input = input xrange = range -SPARK_EC2_VERSION = 1.4.0 +SPARK_EC2_VERSION = 1.5.0 SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__)) VALID_SPARK_VERSIONS = set([ @@ -71,6 +71,7 @@ VALID_SPARK_VERSIONS = set([ 1.3.0, 1.3.1, 1.4.0, +1.5.0 ]) SPARK_TACHYON_MAP = { @@ -84,6 +85,7 @@ SPARK_TACHYON_MAP = { 1.3.0: 0.5.0, 1.3.1: 0.5.0, 1.4.0: 0.6.4, +1.5.0: 0.7.1 } DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION @@ -91,7 +93,7 @@ DEFAULT_SPARK_GITHUB_REPO = https://github.com/apache/spark; # Default location to get the spark-ec2 scripts (and ami-list) from DEFAULT_SPARK_EC2_GITHUB_REPO = https://github.com/amplab/spark-ec2; -DEFAULT_SPARK_EC2_BRANCH = branch-1.4 +DEFAULT_SPARK_EC2_BRANCH = branch-1.5 def setup_external_libs(libs): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[3/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.
Version update for Spark 1.5.0 and add CHANGES.txt file. Author: Reynold Xin r...@databricks.com Closes #8365 from rxin/1.5-update. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f65759e3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f65759e3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f65759e3 Branch: refs/heads/branch-1.5 Commit: f65759e3aa947e96c9db7e976a8f8018979e065d Parents: 14c8c0c Author: Reynold Xin r...@databricks.com Authored: Fri Aug 21 14:54:45 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Fri Aug 21 14:54:45 2015 -0700 -- CHANGES.txt | 24649 + .../main/scala/org/apache/spark/package.scala | 3 +- dev/create-release/generate-changelist.py | 4 +- docs/_config.yml| 2 +- ec2/spark_ec2.py| 6 +- 5 files changed, 24657 insertions(+), 7 deletions(-) -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.
Repository: spark Updated Branches: refs/heads/master f5b028ed2 - e3355090d [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary. https://issues.apache.org/jira/browse/SPARK-10143 With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data. I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference. Without this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png) With this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png) Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size. Tested it on a cluster using ``` val count = sqlContext.table(store_sales).groupBy().count().queryExecution.executedPlan(3).execute().count ``` Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master. Author: Yin Huai yh...@databricks.com Closes #8346 from yhuai/parquetMinSplit. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3355090 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3355090 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3355090 Branch: refs/heads/master Commit: e3355090d4030daffed5efb0959bf1d724c13c13 Parents: f5b028e Author: Yin Huai yh...@databricks.com Authored: Fri Aug 21 14:30:00 2015 -0700 Committer: Yin Huai yh...@databricks.com Committed: Fri Aug 21 14:30:00 2015 -0700 -- .../datasources/parquet/ParquetRelation.scala | 41 +++- 1 file changed, 39 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e3355090/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala index 68169d4..bbf682a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala @@ -26,6 +26,7 @@ import scala.collection.mutable import scala.util.{Failure, Try} import com.google.common.base.Objects +import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, Path} import org.apache.hadoop.io.Writable import org.apache.hadoop.mapreduce._ @@ -281,12 +282,18 @@ private[sql] class ParquetRelation( val assumeInt96IsTimestamp = sqlContext.conf.isParquetINT96AsTimestamp val followParquetFormatSpec = sqlContext.conf.followParquetFormatSpec +// Parquet row group size. We will use this value as the value for +// mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size if the value +// of these flags are smaller than the parquet row group size. +val parquetBlockSize = ParquetOutputFormat.getLongBlockSize(broadcastedConf.value.value) + // Create the function to set variable Parquet confs at both driver and executor side. val initLocalJobFuncOpt = ParquetRelation.initializeLocalJobFunc( requiredColumns, filters, dataSchema, +parquetBlockSize, useMetadataCache, parquetFilterPushDown, assumeBinaryIsString, @@ -294,7 +301,8 @@ private[sql] class ParquetRelation( followParquetFormatSpec) _ // Create the function to set input paths at the driver side. -val setInputPaths = ParquetRelation.initializeDriverSideJobFunc(inputFiles) _ +val setInputPaths = + ParquetRelation.initializeDriverSideJobFunc(inputFiles, parquetBlockSize) _ Utils.withDummyCallSite(sqlContext.sparkContext) { new SqlNewHadoopRDD( @@ -482,11 +490,35 @@ private[sql]
spark git commit: [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.
Repository: spark Updated Branches: refs/heads/branch-1.5 e7db8761b - 14c8c0c0d [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary. https://issues.apache.org/jira/browse/SPARK-10143 With this PR, we will set min split size to parquet's block size (row group size) set in the conf if the min split size is smaller. So, we can avoid have too many tasks and even useless tasks for reading parquet data. I tested it locally. The table I have has 343MB and it is in my local FS. Because I did not set any min/max split size, the default split size was 32MB and the map stage had 11 tasks. But there were only three tasks that actually read data. With my PR, there were only three tasks in the map stage. Here is the difference. Without this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png) With this PR: ![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png) Even if the block size setting does match the actual block size of parquet file, I think it is still generally good to use parquet's block size setting if min split size is smaller than this block size. Tested it on a cluster using ``` val count = sqlContext.table(store_sales).groupBy().count().queryExecution.executedPlan(3).execute().count ``` Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this patch, the job had 2893 tasks and spent 64s. It is still not as good as using one mapper per file (1824 tasks and 42s), but it is much better than our master. Author: Yin Huai yh...@databricks.com Closes #8346 from yhuai/parquetMinSplit. (cherry picked from commit e3355090d4030daffed5efb0959bf1d724c13c13) Signed-off-by: Yin Huai yh...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/14c8c0c0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/14c8c0c0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/14c8c0c0 Branch: refs/heads/branch-1.5 Commit: 14c8c0c0da1184c587f0d5ab60f1d56feaa588e4 Parents: e7db876 Author: Yin Huai yh...@databricks.com Authored: Fri Aug 21 14:30:00 2015 -0700 Committer: Yin Huai yh...@databricks.com Committed: Fri Aug 21 14:30:12 2015 -0700 -- .../datasources/parquet/ParquetRelation.scala | 41 +++- 1 file changed, 39 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/14c8c0c0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala index 68169d4..bbf682a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala @@ -26,6 +26,7 @@ import scala.collection.mutable import scala.util.{Failure, Try} import com.google.common.base.Objects +import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, Path} import org.apache.hadoop.io.Writable import org.apache.hadoop.mapreduce._ @@ -281,12 +282,18 @@ private[sql] class ParquetRelation( val assumeInt96IsTimestamp = sqlContext.conf.isParquetINT96AsTimestamp val followParquetFormatSpec = sqlContext.conf.followParquetFormatSpec +// Parquet row group size. We will use this value as the value for +// mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size if the value +// of these flags are smaller than the parquet row group size. +val parquetBlockSize = ParquetOutputFormat.getLongBlockSize(broadcastedConf.value.value) + // Create the function to set variable Parquet confs at both driver and executor side. val initLocalJobFuncOpt = ParquetRelation.initializeLocalJobFunc( requiredColumns, filters, dataSchema, +parquetBlockSize, useMetadataCache, parquetFilterPushDown, assumeBinaryIsString, @@ -294,7 +301,8 @@ private[sql] class ParquetRelation( followParquetFormatSpec) _ // Create the function to set input paths at the driver side. -val setInputPaths = ParquetRelation.initializeDriverSideJobFunc(inputFiles) _ +val setInputPaths = + ParquetRelation.initializeDriverSideJobFunc(inputFiles, parquetBlockSize) _
[1/2] spark git commit: Preparing Spark release v1.5.0-rc2
Repository: spark Updated Branches: refs/heads/branch-1.5 f65759e3a - 914da3593 Preparing Spark release v1.5.0-rc2 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2569282 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2569282 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2569282 Branch: refs/heads/branch-1.5 Commit: e2569282a80570b25959e968898a53d5fad67bbb Parents: f65759e Author: Patrick Wendell pwend...@gmail.com Authored: Fri Aug 21 14:56:43 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Fri Aug 21 14:56:43 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 33 files changed, 33 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 7b41ebb..3ef7d6f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.1-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 16bf17c..684e07b 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.1-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index beb547f..6b082ad 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.1-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 3926b79..9ef1eda 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.1-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index bdd5037..fe878e6 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.1-SNAPSHOT/version +version1.5.0/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/external/flume-sink/pom.xml -- diff --git
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-rc2 [created] e2569282a - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 1.5.0-SNAPSHOT
Preparing development version 1.5.0-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/914da359 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/914da359 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/914da359 Branch: refs/heads/branch-1.5 Commit: 914da3593e288543b7666423d6612cb1da792668 Parents: e256928 Author: Patrick Wendell pwend...@gmail.com Authored: Fri Aug 21 14:56:50 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Fri Aug 21 14:56:50 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt-assembly/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 33 files changed, 33 insertions(+), 33 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 3ef7d6f..e9c6d26 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 684e07b..ed5c37e 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 6b082ad..4f79d71 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 9ef1eda..e6884b0 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index fe878e6..e05e431 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index
[1/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation
Repository: spark Updated Branches: refs/heads/branch-1.5 4e72839b7 - e7db8761b http://git-wip-us.apache.org/repos/asf/spark/blob/e7db8761/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala b/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala index a2ab95c..cd3ed8a 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala @@ -20,7 +20,7 @@ package org.apache.spark.mllib.regression import scala.reflect.ClassTag import org.apache.spark.Logging -import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.annotation.{DeveloperApi, Since} import org.apache.spark.api.java.JavaSparkContext.fakeClassTag import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.streaming.api.java.{JavaDStream, JavaPairDStream} @@ -54,8 +54,8 @@ import org.apache.spark.streaming.dstream.DStream * the model using each of the different sources, in sequence. * * - * @since 1.1.0 */ +@Since(1.1.0) @DeveloperApi abstract class StreamingLinearAlgorithm[ M : GeneralizedLinearModel, @@ -70,8 +70,8 @@ abstract class StreamingLinearAlgorithm[ /** * Return the latest model. * - * @since 1.1.0 */ + @Since(1.1.0) def latestModel(): M = { model.get } @@ -84,8 +84,8 @@ abstract class StreamingLinearAlgorithm[ * * @param data DStream containing labeled data * - * @since 1.3.0 */ + @Since(1.3.0) def trainOn(data: DStream[LabeledPoint]): Unit = { if (model.isEmpty) { throw new IllegalArgumentException(Model must be initialized before starting training.) @@ -106,8 +106,8 @@ abstract class StreamingLinearAlgorithm[ /** * Java-friendly version of `trainOn`. * - * @since 1.3.0 */ + @Since(1.3.0) def trainOn(data: JavaDStream[LabeledPoint]): Unit = trainOn(data.dstream) /** @@ -116,8 +116,8 @@ abstract class StreamingLinearAlgorithm[ * @param data DStream containing feature vectors * @return DStream containing predictions * - * @since 1.1.0 */ + @Since(1.1.0) def predictOn(data: DStream[Vector]): DStream[Double] = { if (model.isEmpty) { throw new IllegalArgumentException(Model must be initialized before starting prediction.) @@ -128,8 +128,8 @@ abstract class StreamingLinearAlgorithm[ /** * Java-friendly version of `predictOn`. * - * @since 1.1.0 */ + @Since(1.1.0) def predictOn(data: JavaDStream[Vector]): JavaDStream[java.lang.Double] = { JavaDStream.fromDStream(predictOn(data.dstream).asInstanceOf[DStream[java.lang.Double]]) } @@ -140,8 +140,8 @@ abstract class StreamingLinearAlgorithm[ * @tparam K key type * @return DStream containing the input keys and the predictions as values * - * @since 1.1.0 */ + @Since(1.1.0) def predictOnValues[K: ClassTag](data: DStream[(K, Vector)]): DStream[(K, Double)] = { if (model.isEmpty) { throw new IllegalArgumentException(Model must be initialized before starting prediction) @@ -153,8 +153,8 @@ abstract class StreamingLinearAlgorithm[ /** * Java-friendly version of `predictOnValues`. * - * @since 1.3.0 */ + @Since(1.3.0) def predictOnValues[K](data: JavaPairDStream[K, Vector]): JavaPairDStream[K, java.lang.Double] = { implicit val tag = fakeClassTag[K] JavaPairDStream.fromPairDStream( http://git-wip-us.apache.org/repos/asf/spark/blob/e7db8761/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala b/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala index 93a6753..4a856f7 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala @@ -19,7 +19,7 @@ package org.apache.spark.mllib.stat import com.github.fommil.netlib.BLAS.{getInstance = blas} -import org.apache.spark.annotation.Experimental +import org.apache.spark.annotation.{Experimental, Since} import org.apache.spark.api.java.JavaRDD import org.apache.spark.rdd.RDD @@ -37,8 +37,8 @@ import org.apache.spark.rdd.RDD * .setBandwidth(3.0) * val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) * }}} - * @since 1.4.0 */ +@Since(1.4.0) @Experimental class KernelDensity extends Serializable { @@ -52,8 +52,8 @@ class KernelDensity extends Serializable { /** * Sets the bandwidth (standard deviation) of the Gaussian kernel (default: `1.0`). - * @since 1.4.0 */ + @Since(1.4.0) def setBandwidth(bandwidth:
[2/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation
http://git-wip-us.apache.org/repos/asf/spark/blob/f5b028ed/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala index 7f4de77..ba3b447 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala @@ -20,7 +20,7 @@ import scala.collection.JavaConverters._ import scala.reflect.ClassTag import org.apache.spark.Logging -import org.apache.spark.annotation.Experimental +import org.apache.spark.annotation.{Experimental, Since} import org.apache.spark.api.java.JavaRDD import org.apache.spark.api.java.JavaSparkContext.fakeClassTag import org.apache.spark.mllib.fpm.AssociationRules.Rule @@ -33,24 +33,22 @@ import org.apache.spark.rdd.RDD * Generates association rules from a [[RDD[FreqItemset[Item]]]. This method only generates * association rules which have a single item as the consequent. * - * @since 1.5.0 */ +@Since(1.5.0) @Experimental class AssociationRules private[fpm] ( private var minConfidence: Double) extends Logging with Serializable { /** * Constructs a default instance with default parameters {minConfidence = 0.8}. - * - * @since 1.5.0 */ + @Since(1.5.0) def this() = this(0.8) /** * Sets the minimal confidence (default: `0.8`). - * - * @since 1.5.0 */ + @Since(1.5.0) def setMinConfidence(minConfidence: Double): this.type = { require(minConfidence = 0.0 minConfidence = 1.0) this.minConfidence = minConfidence @@ -62,8 +60,8 @@ class AssociationRules private[fpm] ( * @param freqItemsets frequent itemset model obtained from [[FPGrowth]] * @return a [[Set[Rule[Item]]] containing the assocation rules. * - * @since 1.5.0 */ + @Since(1.5.0) def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): RDD[Rule[Item]] = { // For candidate rule X = Y, generate (X, (Y, freq(X union Y))) val candidates = freqItemsets.flatMap { itemset = @@ -102,8 +100,8 @@ object AssociationRules { * instead. * @tparam Item item type * - * @since 1.5.0 */ + @Since(1.5.0) @Experimental class Rule[Item] private[fpm] ( val antecedent: Array[Item], @@ -114,8 +112,8 @@ object AssociationRules { /** * Returns the confidence of the rule. * - * @since 1.5.0 */ +@Since(1.5.0) def confidence: Double = freqUnion.toDouble / freqAntecedent require(antecedent.toSet.intersect(consequent.toSet).isEmpty, { @@ -127,8 +125,8 @@ object AssociationRules { /** * Returns antecedent in a Java List. * - * @since 1.5.0 */ +@Since(1.5.0) def javaAntecedent: java.util.List[Item] = { antecedent.toList.asJava } @@ -136,8 +134,8 @@ object AssociationRules { /** * Returns consequent in a Java List. * - * @since 1.5.0 */ +@Since(1.5.0) def javaConsequent: java.util.List[Item] = { consequent.toList.asJava } http://git-wip-us.apache.org/repos/asf/spark/blob/f5b028ed/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index e2370a5..e37f806 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -25,7 +25,7 @@ import scala.collection.JavaConverters._ import scala.reflect.ClassTag import org.apache.spark.{HashPartitioner, Logging, Partitioner, SparkException} -import org.apache.spark.annotation.Experimental +import org.apache.spark.annotation.{Experimental, Since} import org.apache.spark.api.java.JavaRDD import org.apache.spark.api.java.JavaSparkContext.fakeClassTag import org.apache.spark.mllib.fpm.FPGrowth._ @@ -39,15 +39,15 @@ import org.apache.spark.storage.StorageLevel * @param freqItemsets frequent itemset, which is an RDD of [[FreqItemset]] * @tparam Item item type * - * @since 1.3.0 */ +@Since(1.3.0) @Experimental class FPGrowthModel[Item: ClassTag](val freqItemsets: RDD[FreqItemset[Item]]) extends Serializable { /** * Generates association rules for the [[Item]]s in [[freqItemsets]]. * @param confidence minimal confidence of the rules produced - * @since 1.5.0 */ + @Since(1.5.0) def generateAssociationRules(confidence: Double): RDD[AssociationRules.Rule[Item]] = { val associationRules = new AssociationRules(confidence) associationRules.run(freqItemsets) @@ -71,8 +71,8 @@ class FPGrowthModel[Item: ClassTag](val freqItemsets: RDD[FreqItemset[Item]]) ex * @see
spark git commit: [SPARK-9893] User guide with Java test suite for VectorSlicer
Repository: spark Updated Branches: refs/heads/master f01c4220d - 630a994e6 [SPARK-9893] User guide with Java test suite for VectorSlicer Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin yinxu...@gmail.com Closes #8267 from yinxusen/SPARK-9893. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/630a994e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/630a994e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/630a994e Branch: refs/heads/master Commit: 630a994e6a9785d1704f8e7fb604f32f5dea24f8 Parents: f01c422 Author: Xusen Yin yinxu...@gmail.com Authored: Fri Aug 21 16:30:12 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 21 16:30:12 2015 -0700 -- docs/ml-features.md | 133 +++ .../spark/ml/feature/JavaVectorSlicerSuite.java | 85 2 files changed, 218 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/630a994e/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6309db9..642a4b4 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1477,6 +1477,139 @@ print(output.select(features, clicked).first()) /div /div +# Feature Selectors + +## VectorSlicer + +`VectorSlicer` is a transformer that takes a feature vector and outputs a new feature vector with a +sub-array of the original features. It is useful for extracting features from a vector column. + +`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column +whose values are selected via those indices. There are two types of indices, + + 1. Integer indices that represents the indices into the vector, `setIndices()`; + + 2. String indices that represents the names of features into the vector, `setNames()`. + *This requires the vector column to have an `AttributeGroup` since the implementation matches on + the name field of an `Attribute`.* + +Specification by integer and string are both acceptable. Moreover, you can use integer index and +string name simultaneously. At least one feature must be selected. Duplicate features are not +allowed, so there can be no overlap between selected indices and names. Note that if names of +features are selected, an exception will be threw out when encountering with empty input attributes. + +The output vector will order features with the selected indices first (in the order given), +followed by the selected names (in the order given). + +**Examples** + +Suppose that we have a DataFrame with the column `userFeatures`: + +~~~ + userFeatures +-- + [0.0, 10.0, 0.5] +~~~ + +`userFeatures` is a vector column that contains three user features. Assuming that the first column +of `userFeatures` are all zeros, so we want to remove it and only the last two columns are selected. +The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then produces a new vector +column named `features`: + +~~~ + userFeatures | features +--|- + [0.0, 10.0, 0.5] | [10.0, 0.5] +~~~ + +Suppose also that we have a potential input attributes for the `userFeatures`, i.e. +`[f1, f2, f3]`, then we can use `setNames(f2, f3)` to select them. + +~~~ + userFeatures | features +--|- + [0.0, 10.0, 0.5] | [10.0, 0.5] + [f1, f2, f3] | [f2, f3] +~~~ + +div class=codetabs +div data-lang=scala markdown=1 + +[`VectorSlicer`](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer) takes an input +column name with specified indices or names and an output column name. + +{% highlight scala %} +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute} +import org.apache.spark.ml.feature.VectorSlicer +import org.apache.spark.sql.types.StructType +import org.apache.spark.sql.{DataFrame, Row, SQLContext} + +val data = Array( + Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), + Vectors.dense(-2.0, 2.3, 0.0) +) + +val defaultAttr = NumericAttribute.defaultAttr +val attrs = Array(f1, f2, f3).map(defaultAttr.withName) +val attrGroup = new AttributeGroup(userFeatures, attrs.asInstanceOf[Array[Attribute]]) + +val dataRDD = sc.parallelize(data).map(Row.apply) +val dataset = sqlContext.createDataFrame(dataRDD, StructType(attrGroup.toStructField())) + +val slicer = new VectorSlicer().setInputCol(userFeatures).setOutputCol(features) + +slicer.setIndices(1).setNames(f3) +// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array(f2, f3)) + +val output =
spark git commit: [SPARK-10163] [ML] Allow single-category features for GBT models
Repository: spark Updated Branches: refs/heads/master e3355090d - f01c4220d [SPARK-10163] [ML] Allow single-category features for GBT models Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #8367 from jkbradley/gbt-single-cat. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f01c4220 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f01c4220 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f01c4220 Branch: refs/heads/master Commit: f01c4220d2b791f470fa6596ffe11baa51517fbe Parents: e335509 Author: Joseph K. Bradley jos...@databricks.com Authored: Fri Aug 21 16:28:00 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 21 16:28:00 2015 -0700 -- .../org/apache/spark/mllib/tree/configuration/Strategy.scala| 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f01c4220/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala index a58f01b..b74e3f1 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala @@ -158,11 +158,6 @@ class Strategy ( s Valid values are integers = 0.) require(maxBins = 2, sDecisionTree Strategy given invalid maxBins parameter: $maxBins. + s Valid values are integers = 2.) -categoricalFeaturesInfo.foreach { case (feature, arity) = - require(arity = 2, -sDecisionTree Strategy given invalid categoricalFeaturesInfo setting: + -s feature $feature has $arity categories. The number of categories should be = 2.) -} require(minInstancesPerNode = 1, sDecisionTree Strategy requires minInstancesPerNode = 1 but was given $minInstancesPerNode) require(maxMemoryInMB = 10240, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10163] [ML] Allow single-category features for GBT models
Repository: spark Updated Branches: refs/heads/branch-1.5 914da3593 - cb61c7b4e [SPARK-10163] [ML] Allow single-category features for GBT models Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #8367 from jkbradley/gbt-single-cat. (cherry picked from commit f01c4220d2b791f470fa6596ffe11baa51517fbe) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb61c7b4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb61c7b4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb61c7b4 Branch: refs/heads/branch-1.5 Commit: cb61c7b4e3d06efdbb61795603c71b3a989d6df1 Parents: 914da35 Author: Joseph K. Bradley jos...@databricks.com Authored: Fri Aug 21 16:28:00 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 21 16:28:07 2015 -0700 -- .../org/apache/spark/mllib/tree/configuration/Strategy.scala| 5 - 1 file changed, 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cb61c7b4/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala index a58f01b..b74e3f1 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala @@ -158,11 +158,6 @@ class Strategy ( s Valid values are integers = 0.) require(maxBins = 2, sDecisionTree Strategy given invalid maxBins parameter: $maxBins. + s Valid values are integers = 2.) -categoricalFeaturesInfo.foreach { case (feature, arity) = - require(arity = 2, -sDecisionTree Strategy given invalid categoricalFeaturesInfo setting: + -s feature $feature has $arity categories. The number of categories should be = 2.) -} require(minInstancesPerNode = 1, sDecisionTree Strategy requires minInstancesPerNode = 1 but was given $minInstancesPerNode) require(maxMemoryInMB = 10240, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10040] [SQL] Use batch insert for JDBC writing
Repository: spark Updated Branches: refs/heads/master dcfe0c5cd - bb220f657 [SPARK-10040] [SQL] Use batch insert for JDBC writing JIRA: https://issues.apache.org/jira/browse/SPARK-10040 We should use batch insert instead of single row in JDBC. Author: Liang-Chi Hsieh vii...@appier.com Closes #8273 from viirya/jdbc-insert-batch. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb220f65 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb220f65 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb220f65 Branch: refs/heads/master Commit: bb220f6570aa0b95598b30524224a3e82c1effbc Parents: dcfe0c5 Author: Liang-Chi Hsieh vii...@appier.com Authored: Fri Aug 21 01:43:49 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Fri Aug 21 01:43:49 2015 -0700 -- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bb220f65/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala index 2d0e736..26788b2 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala @@ -88,13 +88,15 @@ object JdbcUtils extends Logging { table: String, iterator: Iterator[Row], rddSchema: StructType, - nullTypes: Array[Int]): Iterator[Byte] = { + nullTypes: Array[Int], + batchSize: Int): Iterator[Byte] = { val conn = getConnection() var committed = false try { conn.setAutoCommit(false) // Everything in the same db transaction. val stmt = insertStatement(conn, table, rddSchema) try { +var rowCount = 0 while (iterator.hasNext) { val row = iterator.next() val numFields = rddSchema.fields.length @@ -122,7 +124,15 @@ object JdbcUtils extends Logging { } i = i + 1 } - stmt.executeUpdate() + stmt.addBatch() + rowCount += 1 + if (rowCount % batchSize == 0) { +stmt.executeBatch() +rowCount = 0 + } +} +if (rowCount 0) { + stmt.executeBatch() } } finally { stmt.close() @@ -211,8 +221,9 @@ object JdbcUtils extends Logging { val rddSchema = df.schema val driver: String = DriverRegistry.getDriverClassName(url) val getConnection: () = Connection = JDBCRDD.getConnector(driver, url, properties) +val batchSize = properties.getProperty(batchsize, 1000).toInt df.foreachPartition { iterator = - savePartition(getConnection, table, iterator, rddSchema, nullTypes) + savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org