spark git commit: [SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb

2015-08-21 Thread tgraves
Repository: spark
Updated Branches:
  refs/heads/master bb220f657 - 708036c1d


[SPARK-9439] [YARN] External shuffle service robust to NM restarts using leveldb

https://issues.apache.org/jira/browse/SPARK-9439

In general, Yarn apps should be robust to NodeManager restarts.  However, if 
you run spark with the external shuffle service on, after a NM restart all 
shuffles fail, b/c the shuffle service has lost some state with info on each 
executor.  (Note the shuffle data is perfectly fine on disk across a NM 
restart, the problem is we've lost the small bit of state that lets us *find* 
those files.)

The solution proposed here is that the external shuffle service can write out 
its state to leveldb (backed by a local file) every time an executor is added.  
When running with yarn, that file is in the NM's local dir.  Whenever the 
service is started, it looks for that file, and if it exists, it reads the file 
and re-registers all executors there.

Nothing is changed in non-yarn modes with this patch.  The service is not given 
a place to save the state to, so it operates the same as before.  This should 
make it easy to update other cluster managers as well, by just supplying the 
right file  the equivalent of yarn's `initializeApplication` -- I'm not 
familiar enough with those modes to know how to do that.

Author: Imran Rashid iras...@cloudera.com

Closes #7943 from squito/leveldb_external_shuffle_service_NM_restart and 
squashes the following commits:

0d285d3 [Imran Rashid] review feedback
70951d6 [Imran Rashid] Merge branch 'master' into 
leveldb_external_shuffle_service_NM_restart
5c71c8c [Imran Rashid] save executor to db before registering; style
2499c8c [Imran Rashid] explicit dependency on jackson-annotations
795d28f [Imran Rashid] review feedback
81f80e2 [Imran Rashid] Merge branch 'master' into 
leveldb_external_shuffle_service_NM_restart
594d520 [Imran Rashid] use json to serialize application executor info
1a7980b [Imran Rashid] version
8267d2a [Imran Rashid] style
e9f99e8 [Imran Rashid] cleanup the handling of bad dbs a little
9378ba3 [Imran Rashid] fail gracefully on corrupt leveldb files
acedb62 [Imran Rashid] switch to writing out one record per executor
79922b7 [Imran Rashid] rely on yarn to call stopApplication; assorted cleanup
12b6a35 [Imran Rashid] save registered executors when apps are removed; add 
tests
c878fbe [Imran Rashid] better explanation of shuffle service port handling
694934c [Imran Rashid] only open leveldb connection once per service
d596410 [Imran Rashid] store executor data in leveldb
59800b7 [Imran Rashid] Files.move in case renaming is unsupported
32fe5ae [Imran Rashid] Merge branch 'master' into 
external_shuffle_service_NM_restart
d7450f0 [Imran Rashid] style
f729e2b [Imran Rashid] debugging
4492835 [Imran Rashid] lol, dont use a PrintWriter b/c of scalastyle checks
0a39b98 [Imran Rashid] Merge branch 'master' into 
external_shuffle_service_NM_restart
55f49fc [Imran Rashid] make sure the service doesnt die if the registered 
executor file is corrupt; add tests
245db19 [Imran Rashid] style
62586a6 [Imran Rashid] just serialize the whole executors map
bdbbf0d [Imran Rashid] comments, remove some unnecessary changes
857331a [Imran Rashid] better tests  comments
bb9d1e6 [Imran Rashid] formatting
bdc4b32 [Imran Rashid] rename
86e0cb9 [Imran Rashid] for tests, shuffle service finds an open port
23994ff [Imran Rashid] style
7504de8 [Imran Rashid] style
a36729c [Imran Rashid] cleanup
efb6195 [Imran Rashid] proper unit test, and no longer leak if apps stop during 
NM restart
dd93dc0 [Imran Rashid] test for shuffle service w/ NM restarts
d596969 [Imran Rashid] cleanup imports
0e9d69b [Imran Rashid] better names
9eae119 [Imran Rashid] cleanup lots of duplication
1136f44 [Imran Rashid] test needs to have an actual shuffle
0b588bd [Imran Rashid] more fixes ...
ad122ef [Imran Rashid] more fixes
5e5a7c3 [Imran Rashid] fix build
c69f46b [Imran Rashid] maybe working version, needs tests  cleanup ...
bb3ba49 [Imran Rashid] minor cleanup
36127d3 [Imran Rashid] wip
b9d2ced [Imran Rashid] incomplete setup for external shuffle service tests


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/708036c1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/708036c1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/708036c1

Branch: refs/heads/master
Commit: 708036c1de52d674ceff30ac465e1dcedeb8dde8
Parents: bb220f6
Author: Imran Rashid iras...@cloudera.com
Authored: Fri Aug 21 08:41:36 2015 -0500
Committer: Tom Graves tgra...@yahoo-inc.com
Committed: Fri Aug 21 08:41:36 2015 -0500

--
 .../spark/deploy/ExternalShuffleService.scala   |   2 +-
 .../mesos/MesosExternalShuffleService.scala |   2 +-
 .../org/apache/spark/storage/BlockManager.scala |  14 +-
 .../spark/ExternalShuffleServiceSuite.scala |   2 +-
 network/shuffle/pom.xml 

spark git commit: [SPARK-10130] [SQL] type coercion for IF should have children resolved first

2015-08-21 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e5e601739 - 817c38a0a


[SPARK-10130] [SQL] type coercion for IF should have children resolved first

Type coercion for IF should have children resolved first, or we could meet 
unresolved exception.

Author: Daoyuan Wang daoyuan.w...@intel.com

Closes #8331 from adrian-wang/spark10130.

(cherry picked from commit 3c462f5d87a9654c5a68fd658a40f5062029fd9a)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/817c38a0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/817c38a0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/817c38a0

Branch: refs/heads/branch-1.5
Commit: 817c38a0a1405c2bf407070e13e16934c777cd89
Parents: e5e6017
Author: Daoyuan Wang daoyuan.w...@intel.com
Authored: Fri Aug 21 12:21:51 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Fri Aug 21 12:22:08 2015 -0700

--
 .../apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala | 1 +
 .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala   | 7 +++
 2 files changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/817c38a0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
index f2f2ba2..2cb067f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
@@ -639,6 +639,7 @@ object HiveTypeCoercion {
*/
   object IfCoercion extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
+  case e if !e.childrenResolved = e
   // Find tightest common type for If, if the true value and false value 
have different types.
   case i @ If(pred, left, right) if left.dataType != right.dataType =
 findTightestCommonTypeToString(left.dataType, right.dataType).map { 
widestType =

http://git-wip-us.apache.org/repos/asf/spark/blob/817c38a0/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index da50aec..dcb4e83 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -1679,4 +1679,11 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(sqlContext.table(`db.t`), df)
 }
   }
+
+  test(SPARK-10130 type coercion for IF should have children resolved first) 
{
+val df = Seq((1, 1), (-1, 1)).toDF(key, value)
+df.registerTempTable(src)
+checkAnswer(
+  sql(SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp), 
Seq(Row(1), Row(0)))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-10130] [SQL] type coercion for IF should have children resolved first

2015-08-21 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master 708036c1d - 3c462f5d8


[SPARK-10130] [SQL] type coercion for IF should have children resolved first

Type coercion for IF should have children resolved first, or we could meet 
unresolved exception.

Author: Daoyuan Wang daoyuan.w...@intel.com

Closes #8331 from adrian-wang/spark10130.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c462f5d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c462f5d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c462f5d

Branch: refs/heads/master
Commit: 3c462f5d87a9654c5a68fd658a40f5062029fd9a
Parents: 708036c
Author: Daoyuan Wang daoyuan.w...@intel.com
Authored: Fri Aug 21 12:21:51 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Fri Aug 21 12:21:51 2015 -0700

--
 .../apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala | 1 +
 .../src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala   | 7 +++
 2 files changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3c462f5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
index f2f2ba2..2cb067f 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
@@ -639,6 +639,7 @@ object HiveTypeCoercion {
*/
   object IfCoercion extends Rule[LogicalPlan] {
 def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
+  case e if !e.childrenResolved = e
   // Find tightest common type for If, if the true value and false value 
have different types.
   case i @ If(pred, left, right) if left.dataType != right.dataType =
 findTightestCommonTypeToString(left.dataType, right.dataType).map { 
widestType =

http://git-wip-us.apache.org/repos/asf/spark/blob/3c462f5d/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
--
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
index da50aec..dcb4e83 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
@@ -1679,4 +1679,11 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(sqlContext.table(`db.t`), df)
 }
   }
+
+  test(SPARK-10130 type coercion for IF should have children resolved first) 
{
+val df = Seq((1, 1), (-1, 1)).toDF(key, value)
+df.registerTempTable(src)
+checkAnswer(
+  sql(SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp), 
Seq(Row(1), Row(0)))
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function

2015-08-21 Thread tdas
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 817c38a0a - 4e72839b7


[SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in 
PySpark-Streaming transform function

Details of the bug and explanations can be seen in 
[SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122).

tdas , please help to review.

Author: jerryshao ss...@hortonworks.com

Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits:

4039b16 [jerryshao] Fix getOffsetRanges in transform() bug


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4e72839b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4e72839b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4e72839b

Branch: refs/heads/branch-1.5
Commit: 4e72839b7b1e0b925837b49534a07188a603d838
Parents: 817c38a
Author: jerryshao ss...@hortonworks.com
Authored: Fri Aug 21 13:10:11 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Fri Aug 21 13:17:48 2015 -0700

--
 python/pyspark/streaming/dstream.py | 5 -
 python/pyspark/streaming/tests.py   | 4 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4e72839b/python/pyspark/streaming/dstream.py
--
diff --git a/python/pyspark/streaming/dstream.py 
b/python/pyspark/streaming/dstream.py
index 8dcb964..698336c 100644
--- a/python/pyspark/streaming/dstream.py
+++ b/python/pyspark/streaming/dstream.py
@@ -610,7 +610,10 @@ class TransformedDStream(DStream):
 self.is_checkpointed = False
 self._jdstream_val = None
 
-if (isinstance(prev, TransformedDStream) and
+# Using type() to avoid folding the functions and compacting the 
DStreams which is not
+# not strictly a object of TransformedDStream.
+# Changed here is to avoid bug in KafkaTransformedDStream when calling 
offsetRanges().
+if (type(prev) is TransformedDStream and
 not prev.is_cached and not prev.is_checkpointed):
 prev_func = prev.func
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))

http://git-wip-us.apache.org/repos/asf/spark/blob/4e72839b/python/pyspark/streaming/tests.py
--
diff --git a/python/pyspark/streaming/tests.py 
b/python/pyspark/streaming/tests.py
index 6108c84..214d5be 100644
--- a/python/pyspark/streaming/tests.py
+++ b/python/pyspark/streaming/tests.py
@@ -850,7 +850,9 @@ class KafkaStreamTests(PySparkStreamingTestCase):
 offsetRanges.append(o)
 return rdd
 
-stream.transform(transformWithOffsetRanges).foreachRDD(lambda rdd: 
rdd.count())
+# Test whether it is ok mixing KafkaTransformedDStream and 
TransformedDStream together,
+# only the TransformedDstreams can be folded together.
+stream.transform(transformWithOffsetRanges).map(lambda kv: 
kv[1]).count().pprint()
 self.ssc.start()
 self.wait_for(offsetRanges, 1)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[3/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation

2015-08-21 Thread meng
[SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation

Author: MechCoder manojkumarsivaraj...@gmail.com

Closes #8352 from MechCoder/since.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5b028ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5b028ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5b028ed

Branch: refs/heads/master
Commit: f5b028ed2f1ad6de43c8b50ebf480e1b6c047035
Parents: d89cc38
Author: MechCoder manojkumarsivaraj...@gmail.com
Authored: Fri Aug 21 14:19:24 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 21 14:19:24 2015 -0700

--
 .../classification/ClassificationModel.scala|   8 +-
 .../classification/LogisticRegression.scala |  30 +++---
 .../spark/mllib/classification/NaiveBayes.scala |   7 +-
 .../apache/spark/mllib/classification/SVM.scala |  28 ++---
 .../mllib/clustering/GaussianMixture.scala  |  28 ++---
 .../mllib/clustering/GaussianMixtureModel.scala |  28 ++---
 .../apache/spark/mllib/clustering/KMeans.scala  |  50 -
 .../spark/mllib/clustering/KMeansModel.scala|  27 ++---
 .../org/apache/spark/mllib/clustering/LDA.scala |  56 +-
 .../spark/mllib/clustering/LDAModel.scala   |  69 +---
 .../spark/mllib/clustering/LDAOptimizer.scala   |  24 ++---
 .../clustering/PowerIterationClustering.scala   |  38 +++
 .../mllib/clustering/StreamingKMeans.scala  |  35 +++---
 .../BinaryClassificationMetrics.scala   |  26 ++---
 .../mllib/evaluation/MulticlassMetrics.scala|  20 ++--
 .../mllib/evaluation/MultilabelMetrics.scala|   9 +-
 .../spark/mllib/evaluation/RankingMetrics.scala |  10 +-
 .../mllib/evaluation/RegressionMetrics.scala|  14 +--
 .../spark/mllib/fpm/AssociationRules.scala  |  20 ++--
 .../org/apache/spark/mllib/fpm/FPGrowth.scala   |  22 ++--
 .../apache/spark/mllib/linalg/Matrices.scala| 106 ---
 .../linalg/SingularValueDecomposition.scala |   4 +-
 .../org/apache/spark/mllib/linalg/Vectors.scala |  90 +---
 .../mllib/linalg/distributed/BlockMatrix.scala  |  88 +++
 .../linalg/distributed/CoordinateMatrix.scala   |  40 +++
 .../linalg/distributed/DistributedMatrix.scala  |   4 +-
 .../linalg/distributed/IndexedRowMatrix.scala   |  38 +++
 .../mllib/linalg/distributed/RowMatrix.scala|  39 +++
 .../apache/spark/mllib/recommendation/ALS.scala |  22 ++--
 .../MatrixFactorizationModel.scala  |  28 +++--
 .../regression/GeneralizedLinearAlgorithm.scala |  24 ++---
 .../mllib/regression/IsotonicRegression.scala   |  22 ++--
 .../spark/mllib/regression/LabeledPoint.scala   |   7 +-
 .../apache/spark/mllib/regression/Lasso.scala   |  25 ++---
 .../mllib/regression/LinearRegression.scala |  25 ++---
 .../mllib/regression/RegressionModel.scala  |  12 +--
 .../mllib/regression/RidgeRegression.scala  |  25 ++---
 .../regression/StreamingLinearAlgorithm.scala   |  18 ++--
 .../apache/spark/mllib/stat/KernelDensity.scala |  12 +--
 .../stat/MultivariateOnlineSummarizer.scala |  24 ++---
 .../stat/MultivariateStatisticalSummary.scala   |  19 ++--
 .../apache/spark/mllib/stat/Statistics.scala|  30 +++---
 .../distribution/MultivariateGaussian.scala |   8 +-
 .../apache/spark/mllib/tree/DecisionTree.scala  |  28 +++--
 .../spark/mllib/tree/GradientBoostedTrees.scala |  20 ++--
 .../apache/spark/mllib/tree/RandomForest.scala  |  20 ++--
 .../spark/mllib/tree/configuration/Algo.scala   |   4 +-
 .../tree/configuration/BoostingStrategy.scala   |  12 +--
 .../mllib/tree/configuration/FeatureType.scala  |   4 +-
 .../tree/configuration/QuantileStrategy.scala   |   4 +-
 .../mllib/tree/configuration/Strategy.scala |  24 ++---
 .../spark/mllib/tree/impurity/Entropy.scala |  10 +-
 .../apache/spark/mllib/tree/impurity/Gini.scala |  10 +-
 .../spark/mllib/tree/impurity/Impurity.scala|   8 +-
 .../spark/mllib/tree/impurity/Variance.scala|  10 +-
 .../spark/mllib/tree/loss/AbsoluteError.scala   |   6 +-
 .../apache/spark/mllib/tree/loss/LogLoss.scala  |   6 +-
 .../org/apache/spark/mllib/tree/loss/Loss.scala |   8 +-
 .../apache/spark/mllib/tree/loss/Losses.scala   |  10 +-
 .../spark/mllib/tree/loss/SquaredError.scala|   6 +-
 .../mllib/tree/model/DecisionTreeModel.scala|  22 ++--
 .../mllib/tree/model/InformationGainStats.scala |   4 +-
 .../apache/spark/mllib/tree/model/Node.scala|   8 +-
 .../apache/spark/mllib/tree/model/Predict.scala |   4 +-
 .../apache/spark/mllib/tree/model/Split.scala   |   4 +-
 .../mllib/tree/model/treeEnsembleModels.scala   |  26 +++--
 .../org/apache/spark/mllib/tree/package.scala   |   1 -
 .../org/apache/spark/mllib/util/MLUtils.scala   |  36 +++
 68 files changed, 692 insertions(+), 862 deletions(-)

[2/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.

2015-08-21 Thread rxin
http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
new file mode 100644
index 000..95f80d8
--- /dev/null
+++ b/CHANGES.txt
@@ -0,0 +1,24649 @@
+Spark Change Log
+
+
+Release 1.5.0
+
+  [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the 
min split size if necessary.
+  Yin Huai yh...@databricks.com
+  2015-08-21 14:30:00 -0700
+  Commit: 14c8c0c, github.com/apache/spark/pull/8346
+
+  [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since 
annotation
+  MechCoder manojkumarsivaraj...@gmail.com
+  2015-08-21 14:19:24 -0700
+  Commit: e7db876, github.com/apache/spark/pull/8352
+
+  [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in 
PySpark-Streaming transform function
+  jerryshao ss...@hortonworks.com
+  2015-08-21 13:10:11 -0700
+  Commit: 4e72839, github.com/apache/spark/pull/8347
+
+  [SPARK-10130] [SQL] type coercion for IF should have children resolved first
+  Daoyuan Wang daoyuan.w...@intel.com
+  2015-08-21 12:21:51 -0700
+  Commit: 817c38a, github.com/apache/spark/pull/8331
+
+  [SPARK-9846] [DOCS] User guide for Multilayer Perceptron Classifier
+  Alexander Ulanov na...@yandex.ru
+  2015-08-20 20:02:27 -0700
+  Commit: e5e6017, github.com/apache/spark/pull/8262
+
+  [SPARK-10140] [DOC] add target fields to @Since
+  Xiangrui Meng m...@databricks.com
+  2015-08-20 20:01:13 -0700
+  Commit: 04ef52a, github.com/apache/spark/pull/8344
+
+  Preparing development version 1.5.1-SNAPSHOT
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 16:24:12 -0700
+  Commit: 988e838
+
+  Preparing Spark release v1.5.0-rc1
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 16:24:07 -0700
+  Commit: 4c56ad7
+
+  Preparing development version 1.5.0-SNAPSHOT
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 15:33:10 -0700
+  Commit: 175c1d9
+
+  Preparing Spark release v1.5.0-rc1
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 15:33:04 -0700
+  Commit: d837d51
+
+  [SPARK-9245] [MLLIB] LDA topic assignments
+  Joseph K. Bradley jos...@databricks.com
+  2015-08-20 15:01:31 -0700
+  Commit: 2beea65, github.com/apache/spark/pull/8329
+
+  [SPARK-10108] Add since tags to mllib.feature
+  MechCoder manojkumarsivaraj...@gmail.com
+  2015-08-20 14:56:08 -0700
+  Commit: 560ec12, github.com/apache/spark/pull/8309
+
+  [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add 
Java test suite
+  Xiangrui Meng m...@databricks.com
+  2015-08-20 14:47:04 -0700
+  Commit: 2e0d2a9, github.com/apache/spark/pull/8342
+
+  Preparing development version 1.5.0-SNAPSHOT
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 12:43:13 -0700
+  Commit: eac31ab
+
+  Preparing Spark release v1.5.0-rc1
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 12:43:08 -0700
+  Commit: 99eeac8
+
+  [SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke 
snapshot publishing for Scala 2.11
+  Josh Rosen joshro...@databricks.com
+  2015-08-20 11:31:03 -0700
+  Commit: 6026f4f, github.com/apache/spark/pull/8325
+
+  Preparing development version 1.5.0-SNAPSHOT
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 11:06:41 -0700
+  Commit: a1785e3
+
+  Preparing Spark release v1.5.0-rc1
+  Patrick Wendell pwend...@gmail.com
+  2015-08-20 11:06:31 -0700
+  Commit: 19b92c8
+
+  [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array
+  Cheng Lian l...@databricks.com
+  2015-08-20 11:00:24 -0700
+  Commit: 2f47e09, github.com/apache/spark/pull/8341
+
+  [SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type
+  Alex Shkurenko ashkure...@enova.com
+  2015-08-20 10:16:38 -0700
+  Commit: a7027e6, github.com/apache/spark/pull/8239
+
+  [MINOR] [SQL] Fix sphinx warnings in PySpark SQL
+  MechCoder manojkumarsivaraj...@gmail.com
+  2015-08-20 10:05:31 -0700
+  Commit: 257e9d7, github.com/apache/spark/pull/8171
+
+  [SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key 
in aggregation.
+  Reynold Xin r...@databricks.com
+  2015-08-20 07:53:27 -0700
+  Commit: 5be5175, github.com/apache/spark/pull/8332
+
+  [SPARK-10092] [SQL] Backports #8324 to branch-1.5
+  Yin Huai yh...@databricks.com
+  2015-08-20 18:43:24 +0800
+  Commit: 675e224, github.com/apache/spark/pull/8336
+
+  [SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data
+  Tathagata Das tathagata.das1...@gmail.com
+  2015-08-19 21:15:58 -0700
+  Commit: 71aa547, github.com/apache/spark/pull/8328
+
+  [SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop
+  zsxwing zsxw...@gmail.com
+  2015-08-19 19:43:09 -0700
+  Commit: 63922fa, github.com/apache/spark/pull/8326
+
+  [SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode.
+  Timothy Chen tnac...@gmail.com
+  2015-08-19 19:43:26 -0700
+  Commit: a3ed2c3, github.com/apache/spark/pull/8322
+
+  [SPARK-9812] 

[1/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.

2015-08-21 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 14c8c0c0d - f65759e3a


http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/core/src/main/scala/org/apache/spark/package.scala
--
diff --git a/core/src/main/scala/org/apache/spark/package.scala 
b/core/src/main/scala/org/apache/spark/package.scala
index 8ae76c5..5e3b75b 100644
--- a/core/src/main/scala/org/apache/spark/package.scala
+++ b/core/src/main/scala/org/apache/spark/package.scala
@@ -42,6 +42,5 @@ package org.apache
  */
 
 package object spark {
-  // For package docs only
-  val SPARK_VERSION = 1.5.0-SNAPSHOT
+  val SPARK_VERSION = 1.5.0
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/dev/create-release/generate-changelist.py
--
diff --git a/dev/create-release/generate-changelist.py 
b/dev/create-release/generate-changelist.py
index 2e1a35a..37e5651 100755
--- a/dev/create-release/generate-changelist.py
+++ b/dev/create-release/generate-changelist.py
@@ -31,8 +31,8 @@ import time
 import traceback
 
 SPARK_HOME = os.environ[SPARK_HOME]
-NEW_RELEASE_VERSION = 1.0.0
-PREV_RELEASE_GIT_TAG = v0.9.1
+NEW_RELEASE_VERSION = 1.5.0
+PREV_RELEASE_GIT_TAG = v1.4.0
 
 CHANGELIST = CHANGES.txt
 OLD_CHANGELIST = %s.old % (CHANGELIST)

http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/docs/_config.yml
--
diff --git a/docs/_config.yml b/docs/_config.yml
index c0e031a..e3a447e 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -14,7 +14,7 @@ include:
 
 # These allow the documentation to be updated with newer releases
 # of Spark, Scala, and Mesos.
-SPARK_VERSION: 1.5.0-SNAPSHOT
+SPARK_VERSION: 1.5.0
 SPARK_VERSION_SHORT: 1.5.0
 SCALA_BINARY_VERSION: 2.10
 SCALA_VERSION: 2.10.4

http://git-wip-us.apache.org/repos/asf/spark/blob/f65759e3/ec2/spark_ec2.py
--
diff --git a/ec2/spark_ec2.py b/ec2/spark_ec2.py
index 11fd7ee..ccc897f 100755
--- a/ec2/spark_ec2.py
+++ b/ec2/spark_ec2.py
@@ -51,7 +51,7 @@ else:
 raw_input = input
 xrange = range
 
-SPARK_EC2_VERSION = 1.4.0
+SPARK_EC2_VERSION = 1.5.0
 SPARK_EC2_DIR = os.path.dirname(os.path.realpath(__file__))
 
 VALID_SPARK_VERSIONS = set([
@@ -71,6 +71,7 @@ VALID_SPARK_VERSIONS = set([
 1.3.0,
 1.3.1,
 1.4.0,
+1.5.0
 ])
 
 SPARK_TACHYON_MAP = {
@@ -84,6 +85,7 @@ SPARK_TACHYON_MAP = {
 1.3.0: 0.5.0,
 1.3.1: 0.5.0,
 1.4.0: 0.6.4,
+1.5.0: 0.7.1
 }
 
 DEFAULT_SPARK_VERSION = SPARK_EC2_VERSION
@@ -91,7 +93,7 @@ DEFAULT_SPARK_GITHUB_REPO = https://github.com/apache/spark;
 
 # Default location to get the spark-ec2 scripts (and ami-list) from
 DEFAULT_SPARK_EC2_GITHUB_REPO = https://github.com/amplab/spark-ec2;
-DEFAULT_SPARK_EC2_BRANCH = branch-1.4
+DEFAULT_SPARK_EC2_BRANCH = branch-1.5
 
 
 def setup_external_libs(libs):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[3/3] spark git commit: Version update for Spark 1.5.0 and add CHANGES.txt file.

2015-08-21 Thread rxin
Version update for Spark 1.5.0 and add CHANGES.txt file.

Author: Reynold Xin r...@databricks.com

Closes #8365 from rxin/1.5-update.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f65759e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f65759e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f65759e3

Branch: refs/heads/branch-1.5
Commit: f65759e3aa947e96c9db7e976a8f8018979e065d
Parents: 14c8c0c
Author: Reynold Xin r...@databricks.com
Authored: Fri Aug 21 14:54:45 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Fri Aug 21 14:54:45 2015 -0700

--
 CHANGES.txt | 24649 +
 .../main/scala/org/apache/spark/package.scala   | 3 +-
 dev/create-release/generate-changelist.py   | 4 +-
 docs/_config.yml| 2 +-
 ec2/spark_ec2.py| 6 +-
 5 files changed, 24657 insertions(+), 7 deletions(-)
--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.

2015-08-21 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/master f5b028ed2 - e3355090d


[SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the 
min split size if necessary.

https://issues.apache.org/jira/browse/SPARK-10143

With this PR, we will set min split size to parquet's block size (row group 
size) set in the conf if the min split size is smaller. So, we can avoid have 
too many tasks and even useless tasks for reading parquet data.

I tested it locally. The table I have has 343MB and it is in my local FS. 
Because I did not set any min/max split size, the default split size was 32MB 
and the map stage had 11 tasks. But there were only three tasks that actually 
read data. With my PR, there were only three tasks in the map stage. Here is 
the difference.

Without this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)

With this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)

Even if the block size setting does match the actual block size of parquet 
file, I think it is still generally good to use parquet's block size setting if 
min split size is smaller than this block size.

Tested it on a cluster using
```
val count = 
sqlContext.table(store_sales).groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet 
files with size from 80MB to 280MB (1 to 3 row group sizes). Without this 
patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this 
patch, the job had 2893 tasks and spent 64s. It is still not as good as using 
one mapper per file (1824 tasks and 42s), but it is much better than our master.

Author: Yin Huai yh...@databricks.com

Closes #8346 from yhuai/parquetMinSplit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e3355090
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e3355090
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e3355090

Branch: refs/heads/master
Commit: e3355090d4030daffed5efb0959bf1d724c13c13
Parents: f5b028e
Author: Yin Huai yh...@databricks.com
Authored: Fri Aug 21 14:30:00 2015 -0700
Committer: Yin Huai yh...@databricks.com
Committed: Fri Aug 21 14:30:00 2015 -0700

--
 .../datasources/parquet/ParquetRelation.scala   | 41 +++-
 1 file changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e3355090/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
index 68169d4..bbf682a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
@@ -26,6 +26,7 @@ import scala.collection.mutable
 import scala.util.{Failure, Try}
 
 import com.google.common.base.Objects
+import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileStatus, Path}
 import org.apache.hadoop.io.Writable
 import org.apache.hadoop.mapreduce._
@@ -281,12 +282,18 @@ private[sql] class ParquetRelation(
 val assumeInt96IsTimestamp = sqlContext.conf.isParquetINT96AsTimestamp
 val followParquetFormatSpec = sqlContext.conf.followParquetFormatSpec
 
+// Parquet row group size. We will use this value as the value for
+// mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size 
if the value
+// of these flags are smaller than the parquet row group size.
+val parquetBlockSize = 
ParquetOutputFormat.getLongBlockSize(broadcastedConf.value.value)
+
 // Create the function to set variable Parquet confs at both driver and 
executor side.
 val initLocalJobFuncOpt =
   ParquetRelation.initializeLocalJobFunc(
 requiredColumns,
 filters,
 dataSchema,
+parquetBlockSize,
 useMetadataCache,
 parquetFilterPushDown,
 assumeBinaryIsString,
@@ -294,7 +301,8 @@ private[sql] class ParquetRelation(
 followParquetFormatSpec) _
 
 // Create the function to set input paths at the driver side.
-val setInputPaths = 
ParquetRelation.initializeDriverSideJobFunc(inputFiles) _
+val setInputPaths =
+  ParquetRelation.initializeDriverSideJobFunc(inputFiles, 
parquetBlockSize) _
 
 Utils.withDummyCallSite(sqlContext.sparkContext) {
   new SqlNewHadoopRDD(
@@ -482,11 +490,35 @@ private[sql] 

spark git commit: [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the min split size if necessary.

2015-08-21 Thread yhuai
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e7db8761b - 14c8c0c0d


[SPARK-10143] [SQL] Use parquet's block size (row group size) setting as the 
min split size if necessary.

https://issues.apache.org/jira/browse/SPARK-10143

With this PR, we will set min split size to parquet's block size (row group 
size) set in the conf if the min split size is smaller. So, we can avoid have 
too many tasks and even useless tasks for reading parquet data.

I tested it locally. The table I have has 343MB and it is in my local FS. 
Because I did not set any min/max split size, the default split size was 32MB 
and the map stage had 11 tasks. But there were only three tasks that actually 
read data. With my PR, there were only three tasks in the map stage. Here is 
the difference.

Without this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)

With this PR:
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)

Even if the block size setting does match the actual block size of parquet 
file, I think it is still generally good to use parquet's block size setting if 
min split size is smaller than this block size.

Tested it on a cluster using
```
val count = 
sqlContext.table(store_sales).groupBy().count().queryExecution.executedPlan(3).execute().count
```
Basically, it reads 0 column of table `store_sales`. My table has 1824 parquet 
files with size from 80MB to 280MB (1 to 3 row group sizes). Without this 
patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With this 
patch, the job had 2893 tasks and spent 64s. It is still not as good as using 
one mapper per file (1824 tasks and 42s), but it is much better than our master.

Author: Yin Huai yh...@databricks.com

Closes #8346 from yhuai/parquetMinSplit.

(cherry picked from commit e3355090d4030daffed5efb0959bf1d724c13c13)
Signed-off-by: Yin Huai yh...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/14c8c0c0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/14c8c0c0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/14c8c0c0

Branch: refs/heads/branch-1.5
Commit: 14c8c0c0da1184c587f0d5ab60f1d56feaa588e4
Parents: e7db876
Author: Yin Huai yh...@databricks.com
Authored: Fri Aug 21 14:30:00 2015 -0700
Committer: Yin Huai yh...@databricks.com
Committed: Fri Aug 21 14:30:12 2015 -0700

--
 .../datasources/parquet/ParquetRelation.scala   | 41 +++-
 1 file changed, 39 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/14c8c0c0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
index 68169d4..bbf682a 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
@@ -26,6 +26,7 @@ import scala.collection.mutable
 import scala.util.{Failure, Try}
 
 import com.google.common.base.Objects
+import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{FileStatus, Path}
 import org.apache.hadoop.io.Writable
 import org.apache.hadoop.mapreduce._
@@ -281,12 +282,18 @@ private[sql] class ParquetRelation(
 val assumeInt96IsTimestamp = sqlContext.conf.isParquetINT96AsTimestamp
 val followParquetFormatSpec = sqlContext.conf.followParquetFormatSpec
 
+// Parquet row group size. We will use this value as the value for
+// mapreduce.input.fileinputformat.split.minsize and mapred.min.split.size 
if the value
+// of these flags are smaller than the parquet row group size.
+val parquetBlockSize = 
ParquetOutputFormat.getLongBlockSize(broadcastedConf.value.value)
+
 // Create the function to set variable Parquet confs at both driver and 
executor side.
 val initLocalJobFuncOpt =
   ParquetRelation.initializeLocalJobFunc(
 requiredColumns,
 filters,
 dataSchema,
+parquetBlockSize,
 useMetadataCache,
 parquetFilterPushDown,
 assumeBinaryIsString,
@@ -294,7 +301,8 @@ private[sql] class ParquetRelation(
 followParquetFormatSpec) _
 
 // Create the function to set input paths at the driver side.
-val setInputPaths = 
ParquetRelation.initializeDriverSideJobFunc(inputFiles) _
+val setInputPaths =
+  ParquetRelation.initializeDriverSideJobFunc(inputFiles, 
parquetBlockSize) _
 

[1/2] spark git commit: Preparing Spark release v1.5.0-rc2

2015-08-21 Thread pwendell
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 f65759e3a - 914da3593


Preparing Spark release v1.5.0-rc2


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2569282
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2569282
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2569282

Branch: refs/heads/branch-1.5
Commit: e2569282a80570b25959e968898a53d5fad67bbb
Parents: f65759e
Author: Patrick Wendell pwend...@gmail.com
Authored: Fri Aug 21 14:56:43 2015 -0700
Committer: Patrick Wendell pwend...@gmail.com
Committed: Fri Aug 21 14:56:43 2015 -0700

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 33 files changed, 33 insertions(+), 33 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 7b41ebb..3ef7d6f 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.1-SNAPSHOT/version
+version1.5.0/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 16bf17c..684e07b 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.1-SNAPSHOT/version
+version1.5.0/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index beb547f..6b082ad 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.1-SNAPSHOT/version
+version1.5.0/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index 3926b79..9ef1eda 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.1-SNAPSHOT/version
+version1.5.0/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index bdd5037..fe878e6 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.1-SNAPSHOT/version
+version1.5.0/version
 relativePath../../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/e2569282/external/flume-sink/pom.xml
--
diff --git 

Git Push Summary

2015-08-21 Thread pwendell
Repository: spark
Updated Tags:  refs/tags/v1.5.0-rc2 [created] e2569282a

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[2/2] spark git commit: Preparing development version 1.5.0-SNAPSHOT

2015-08-21 Thread pwendell
Preparing development version 1.5.0-SNAPSHOT


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/914da359
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/914da359
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/914da359

Branch: refs/heads/branch-1.5
Commit: 914da3593e288543b7666423d6612cb1da792668
Parents: e256928
Author: Patrick Wendell pwend...@gmail.com
Authored: Fri Aug 21 14:56:50 2015 -0700
Committer: Patrick Wendell pwend...@gmail.com
Committed: Fri Aug 21 14:56:50 2015 -0700

--
 assembly/pom.xml| 2 +-
 bagel/pom.xml   | 2 +-
 core/pom.xml| 2 +-
 examples/pom.xml| 2 +-
 external/flume-assembly/pom.xml | 2 +-
 external/flume-sink/pom.xml | 2 +-
 external/flume/pom.xml  | 2 +-
 external/kafka-assembly/pom.xml | 2 +-
 external/kafka/pom.xml  | 2 +-
 external/mqtt-assembly/pom.xml  | 2 +-
 external/mqtt/pom.xml   | 2 +-
 external/twitter/pom.xml| 2 +-
 external/zeromq/pom.xml | 2 +-
 extras/java8-tests/pom.xml  | 2 +-
 extras/kinesis-asl-assembly/pom.xml | 2 +-
 extras/kinesis-asl/pom.xml  | 2 +-
 extras/spark-ganglia-lgpl/pom.xml   | 2 +-
 graphx/pom.xml  | 2 +-
 launcher/pom.xml| 2 +-
 mllib/pom.xml   | 2 +-
 network/common/pom.xml  | 2 +-
 network/shuffle/pom.xml | 2 +-
 network/yarn/pom.xml| 2 +-
 pom.xml | 2 +-
 repl/pom.xml| 2 +-
 sql/catalyst/pom.xml| 2 +-
 sql/core/pom.xml| 2 +-
 sql/hive-thriftserver/pom.xml   | 2 +-
 sql/hive/pom.xml| 2 +-
 streaming/pom.xml   | 2 +-
 tools/pom.xml   | 2 +-
 unsafe/pom.xml  | 2 +-
 yarn/pom.xml| 2 +-
 33 files changed, 33 insertions(+), 33 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/assembly/pom.xml
--
diff --git a/assembly/pom.xml b/assembly/pom.xml
index 3ef7d6f..e9c6d26 100644
--- a/assembly/pom.xml
+++ b/assembly/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.0/version
+version1.5.0-SNAPSHOT/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/bagel/pom.xml
--
diff --git a/bagel/pom.xml b/bagel/pom.xml
index 684e07b..ed5c37e 100644
--- a/bagel/pom.xml
+++ b/bagel/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.0/version
+version1.5.0-SNAPSHOT/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/core/pom.xml
--
diff --git a/core/pom.xml b/core/pom.xml
index 6b082ad..4f79d71 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.0/version
+version1.5.0-SNAPSHOT/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/examples/pom.xml
--
diff --git a/examples/pom.xml b/examples/pom.xml
index 9ef1eda..e6884b0 100644
--- a/examples/pom.xml
+++ b/examples/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.0/version
+version1.5.0-SNAPSHOT/version
 relativePath../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index fe878e6..e05e431 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -21,7 +21,7 @@
   parent
 groupIdorg.apache.spark/groupId
 artifactIdspark-parent_2.10/artifactId
-version1.5.0/version
+version1.5.0-SNAPSHOT/version
 relativePath../../pom.xml/relativePath
   /parent
 

http://git-wip-us.apache.org/repos/asf/spark/blob/914da359/external/flume-sink/pom.xml
--
diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml
index 

[1/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation

2015-08-21 Thread meng
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 4e72839b7 - e7db8761b


http://git-wip-us.apache.org/repos/asf/spark/blob/e7db8761/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
index a2ab95c..cd3ed8a 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala
@@ -20,7 +20,7 @@ package org.apache.spark.mllib.regression
 import scala.reflect.ClassTag
 
 import org.apache.spark.Logging
-import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.annotation.{DeveloperApi, Since}
 import org.apache.spark.api.java.JavaSparkContext.fakeClassTag
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
 import org.apache.spark.streaming.api.java.{JavaDStream, JavaPairDStream}
@@ -54,8 +54,8 @@ import org.apache.spark.streaming.dstream.DStream
  * the model using each of the different sources, in sequence.
  *
  *
- * @since 1.1.0
  */
+@Since(1.1.0)
 @DeveloperApi
 abstract class StreamingLinearAlgorithm[
 M : GeneralizedLinearModel,
@@ -70,8 +70,8 @@ abstract class StreamingLinearAlgorithm[
   /**
* Return the latest model.
*
-   * @since 1.1.0
*/
+  @Since(1.1.0)
   def latestModel(): M = {
 model.get
   }
@@ -84,8 +84,8 @@ abstract class StreamingLinearAlgorithm[
*
* @param data DStream containing labeled data
*
-   * @since 1.3.0
*/
+  @Since(1.3.0)
   def trainOn(data: DStream[LabeledPoint]): Unit = {
 if (model.isEmpty) {
   throw new IllegalArgumentException(Model must be initialized before 
starting training.)
@@ -106,8 +106,8 @@ abstract class StreamingLinearAlgorithm[
   /**
* Java-friendly version of `trainOn`.
*
-   * @since 1.3.0
*/
+  @Since(1.3.0)
   def trainOn(data: JavaDStream[LabeledPoint]): Unit = trainOn(data.dstream)
 
   /**
@@ -116,8 +116,8 @@ abstract class StreamingLinearAlgorithm[
* @param data DStream containing feature vectors
* @return DStream containing predictions
*
-   * @since 1.1.0
*/
+  @Since(1.1.0)
   def predictOn(data: DStream[Vector]): DStream[Double] = {
 if (model.isEmpty) {
   throw new IllegalArgumentException(Model must be initialized before 
starting prediction.)
@@ -128,8 +128,8 @@ abstract class StreamingLinearAlgorithm[
   /**
* Java-friendly version of `predictOn`.
*
-   * @since 1.1.0
*/
+  @Since(1.1.0)
   def predictOn(data: JavaDStream[Vector]): JavaDStream[java.lang.Double] = {
 
JavaDStream.fromDStream(predictOn(data.dstream).asInstanceOf[DStream[java.lang.Double]])
   }
@@ -140,8 +140,8 @@ abstract class StreamingLinearAlgorithm[
* @tparam K key type
* @return DStream containing the input keys and the predictions as values
*
-   * @since 1.1.0
*/
+  @Since(1.1.0)
   def predictOnValues[K: ClassTag](data: DStream[(K, Vector)]): DStream[(K, 
Double)] = {
 if (model.isEmpty) {
   throw new IllegalArgumentException(Model must be initialized before 
starting prediction)
@@ -153,8 +153,8 @@ abstract class StreamingLinearAlgorithm[
   /**
* Java-friendly version of `predictOnValues`.
*
-   * @since 1.3.0
*/
+  @Since(1.3.0)
   def predictOnValues[K](data: JavaPairDStream[K, Vector]): JavaPairDStream[K, 
java.lang.Double] = {
 implicit val tag = fakeClassTag[K]
 JavaPairDStream.fromPairDStream(

http://git-wip-us.apache.org/repos/asf/spark/blob/e7db8761/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala
index 93a6753..4a856f7 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/stat/KernelDensity.scala
@@ -19,7 +19,7 @@ package org.apache.spark.mllib.stat
 
 import com.github.fommil.netlib.BLAS.{getInstance = blas}
 
-import org.apache.spark.annotation.Experimental
+import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.rdd.RDD
 
@@ -37,8 +37,8 @@ import org.apache.spark.rdd.RDD
  *   .setBandwidth(3.0)
  * val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
  * }}}
- * @since 1.4.0
  */
+@Since(1.4.0)
 @Experimental
 class KernelDensity extends Serializable {
 
@@ -52,8 +52,8 @@ class KernelDensity extends Serializable {
 
   /**
* Sets the bandwidth (standard deviation) of the Gaussian kernel (default: 
`1.0`).
-   * @since 1.4.0
*/
+  @Since(1.4.0)
   def setBandwidth(bandwidth: 

[2/3] spark git commit: [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotation

2015-08-21 Thread meng
http://git-wip-us.apache.org/repos/asf/spark/blob/f5b028ed/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
index 7f4de77..ba3b447 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
@@ -20,7 +20,7 @@ import scala.collection.JavaConverters._
 import scala.reflect.ClassTag
 
 import org.apache.spark.Logging
-import org.apache.spark.annotation.Experimental
+import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.api.java.JavaSparkContext.fakeClassTag
 import org.apache.spark.mllib.fpm.AssociationRules.Rule
@@ -33,24 +33,22 @@ import org.apache.spark.rdd.RDD
  * Generates association rules from a [[RDD[FreqItemset[Item]]]. This method 
only generates
  * association rules which have a single item as the consequent.
  *
- * @since 1.5.0
  */
+@Since(1.5.0)
 @Experimental
 class AssociationRules private[fpm] (
 private var minConfidence: Double) extends Logging with Serializable {
 
   /**
* Constructs a default instance with default parameters {minConfidence = 
0.8}.
-   *
-   * @since 1.5.0
*/
+  @Since(1.5.0)
   def this() = this(0.8)
 
   /**
* Sets the minimal confidence (default: `0.8`).
-   *
-   * @since 1.5.0
*/
+  @Since(1.5.0)
   def setMinConfidence(minConfidence: Double): this.type = {
 require(minConfidence = 0.0  minConfidence = 1.0)
 this.minConfidence = minConfidence
@@ -62,8 +60,8 @@ class AssociationRules private[fpm] (
* @param freqItemsets frequent itemset model obtained from [[FPGrowth]]
* @return a [[Set[Rule[Item]]] containing the assocation rules.
*
-   * @since 1.5.0
*/
+  @Since(1.5.0)
   def run[Item: ClassTag](freqItemsets: RDD[FreqItemset[Item]]): 
RDD[Rule[Item]] = {
 // For candidate rule X = Y, generate (X, (Y, freq(X union Y)))
 val candidates = freqItemsets.flatMap { itemset =
@@ -102,8 +100,8 @@ object AssociationRules {
*   instead.
* @tparam Item item type
*
-   * @since 1.5.0
*/
+  @Since(1.5.0)
   @Experimental
   class Rule[Item] private[fpm] (
   val antecedent: Array[Item],
@@ -114,8 +112,8 @@ object AssociationRules {
 /**
  * Returns the confidence of the rule.
  *
- * @since 1.5.0
  */
+@Since(1.5.0)
 def confidence: Double = freqUnion.toDouble / freqAntecedent
 
 require(antecedent.toSet.intersect(consequent.toSet).isEmpty, {
@@ -127,8 +125,8 @@ object AssociationRules {
 /**
  * Returns antecedent in a Java List.
  *
- * @since 1.5.0
  */
+@Since(1.5.0)
 def javaAntecedent: java.util.List[Item] = {
   antecedent.toList.asJava
 }
@@ -136,8 +134,8 @@ object AssociationRules {
 /**
  * Returns consequent in a Java List.
  *
- * @since 1.5.0
  */
+@Since(1.5.0)
 def javaConsequent: java.util.List[Item] = {
   consequent.toList.asJava
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/f5b028ed/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
index e2370a5..e37f806 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
@@ -25,7 +25,7 @@ import scala.collection.JavaConverters._
 import scala.reflect.ClassTag
 
 import org.apache.spark.{HashPartitioner, Logging, Partitioner, SparkException}
-import org.apache.spark.annotation.Experimental
+import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.api.java.JavaRDD
 import org.apache.spark.api.java.JavaSparkContext.fakeClassTag
 import org.apache.spark.mllib.fpm.FPGrowth._
@@ -39,15 +39,15 @@ import org.apache.spark.storage.StorageLevel
  * @param freqItemsets frequent itemset, which is an RDD of [[FreqItemset]]
  * @tparam Item item type
  *
- * @since 1.3.0
  */
+@Since(1.3.0)
 @Experimental
 class FPGrowthModel[Item: ClassTag](val freqItemsets: RDD[FreqItemset[Item]]) 
extends Serializable {
   /**
* Generates association rules for the [[Item]]s in [[freqItemsets]].
* @param confidence minimal confidence of the rules produced
-   * @since 1.5.0
*/
+  @Since(1.5.0)
   def generateAssociationRules(confidence: Double): 
RDD[AssociationRules.Rule[Item]] = {
 val associationRules = new AssociationRules(confidence)
 associationRules.run(freqItemsets)
@@ -71,8 +71,8 @@ class FPGrowthModel[Item: ClassTag](val freqItemsets: 
RDD[FreqItemset[Item]]) ex
  * @see 

spark git commit: [SPARK-9893] User guide with Java test suite for VectorSlicer

2015-08-21 Thread meng
Repository: spark
Updated Branches:
  refs/heads/master f01c4220d - 630a994e6


[SPARK-9893] User guide with Java test suite for VectorSlicer

Add user guide for `VectorSlicer`, with Java test suite and Python version 
VectorSlicer.

Note that Python version does not support selecting by names now.

Author: Xusen Yin yinxu...@gmail.com

Closes #8267 from yinxusen/SPARK-9893.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/630a994e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/630a994e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/630a994e

Branch: refs/heads/master
Commit: 630a994e6a9785d1704f8e7fb604f32f5dea24f8
Parents: f01c422
Author: Xusen Yin yinxu...@gmail.com
Authored: Fri Aug 21 16:30:12 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 21 16:30:12 2015 -0700

--
 docs/ml-features.md | 133 +++
 .../spark/ml/feature/JavaVectorSlicerSuite.java |  85 
 2 files changed, 218 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/630a994e/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6309db9..642a4b4 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1477,6 +1477,139 @@ print(output.select(features, clicked).first())
 /div
 /div
 
+# Feature Selectors
+
+## VectorSlicer
+
+`VectorSlicer` is a transformer that takes a feature vector and outputs a new 
feature vector with a
+sub-array of the original features. It is useful for extracting features from 
a vector column.
+
+`VectorSlicer` accepts a vector column with a specified indices, then outputs 
a new vector column
+whose values are selected via those indices. There are two types of indices, 
+
+ 1. Integer indices that represents the indices into the vector, 
`setIndices()`;
+
+ 2. String indices that represents the names of features into the vector, 
`setNames()`. 
+ *This requires the vector column to have an `AttributeGroup` since the 
implementation matches on
+ the name field of an `Attribute`.*
+
+Specification by integer and string are both acceptable. Moreover, you can use 
integer index and 
+string name simultaneously. At least one feature must be selected. Duplicate 
features are not
+allowed, so there can be no overlap between selected indices and names. Note 
that if names of
+features are selected, an exception will be threw out when encountering with 
empty input attributes.
+
+The output vector will order features with the selected indices first (in the 
order given),
+followed by the selected names (in the order given).
+
+**Examples**
+
+Suppose that we have a DataFrame with the column `userFeatures`:
+
+~~~
+ userFeatures 
+--
+ [0.0, 10.0, 0.5] 
+~~~
+
+`userFeatures` is a vector column that contains three user features. Assuming 
that the first column
+of `userFeatures` are all zeros, so we want to remove it and only the last two 
columns are selected.
+The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then 
produces a new vector
+column named `features`:
+
+~~~
+ userFeatures | features
+--|-
+ [0.0, 10.0, 0.5] | [10.0, 0.5]
+~~~
+
+Suppose also that we have a potential input attributes for the `userFeatures`, 
i.e. 
+`[f1, f2, f3]`, then we can use `setNames(f2, f3)` to select them.
+
+~~~
+ userFeatures | features
+--|-
+ [0.0, 10.0, 0.5] | [10.0, 0.5]
+ [f1, f2, f3] | [f2, f3]
+~~~
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+[`VectorSlicer`](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer)
 takes an input
+column name with specified indices or names and an output column name.
+
+{% highlight scala %}
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, 
NumericAttribute}
+import org.apache.spark.ml.feature.VectorSlicer
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.sql.{DataFrame, Row, SQLContext}
+
+val data = Array(
+  Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))),
+  Vectors.dense(-2.0, 2.3, 0.0)
+)
+
+val defaultAttr = NumericAttribute.defaultAttr
+val attrs = Array(f1, f2, f3).map(defaultAttr.withName)
+val attrGroup = new AttributeGroup(userFeatures, 
attrs.asInstanceOf[Array[Attribute]])
+
+val dataRDD = sc.parallelize(data).map(Row.apply)
+val dataset = sqlContext.createDataFrame(dataRDD, 
StructType(attrGroup.toStructField()))
+
+val slicer = new 
VectorSlicer().setInputCol(userFeatures).setOutputCol(features)
+
+slicer.setIndices(1).setNames(f3)
+// or slicer.setIndices(Array(1, 2)), or slicer.setNames(Array(f2, f3))
+
+val output = 

spark git commit: [SPARK-10163] [ML] Allow single-category features for GBT models

2015-08-21 Thread meng
Repository: spark
Updated Branches:
  refs/heads/master e3355090d - f01c4220d


[SPARK-10163] [ML] Allow single-category features for GBT models

Removed categorical feature info validation since no longer needed

This is needed to make the ML user guide examples work (in another current PR).

CC: mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #8367 from jkbradley/gbt-single-cat.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f01c4220
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f01c4220
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f01c4220

Branch: refs/heads/master
Commit: f01c4220d2b791f470fa6596ffe11baa51517fbe
Parents: e335509
Author: Joseph K. Bradley jos...@databricks.com
Authored: Fri Aug 21 16:28:00 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 21 16:28:00 2015 -0700

--
 .../org/apache/spark/mllib/tree/configuration/Strategy.scala| 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f01c4220/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
index a58f01b..b74e3f1 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
@@ -158,11 +158,6 @@ class Strategy (
   s  Valid values are integers = 0.)
 require(maxBins = 2, sDecisionTree Strategy given invalid maxBins 
parameter: $maxBins. +
   s  Valid values are integers = 2.)
-categoricalFeaturesInfo.foreach { case (feature, arity) =
-  require(arity = 2,
-sDecisionTree Strategy given invalid categoricalFeaturesInfo 
setting: +
-s feature $feature has $arity categories.  The number of categories 
should be = 2.)
-}
 require(minInstancesPerNode = 1,
   sDecisionTree Strategy requires minInstancesPerNode = 1 but was given 
$minInstancesPerNode)
 require(maxMemoryInMB = 10240,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-10163] [ML] Allow single-category features for GBT models

2015-08-21 Thread meng
Repository: spark
Updated Branches:
  refs/heads/branch-1.5 914da3593 - cb61c7b4e


[SPARK-10163] [ML] Allow single-category features for GBT models

Removed categorical feature info validation since no longer needed

This is needed to make the ML user guide examples work (in another current PR).

CC: mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #8367 from jkbradley/gbt-single-cat.

(cherry picked from commit f01c4220d2b791f470fa6596ffe11baa51517fbe)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cb61c7b4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cb61c7b4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cb61c7b4

Branch: refs/heads/branch-1.5
Commit: cb61c7b4e3d06efdbb61795603c71b3a989d6df1
Parents: 914da35
Author: Joseph K. Bradley jos...@databricks.com
Authored: Fri Aug 21 16:28:00 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 21 16:28:07 2015 -0700

--
 .../org/apache/spark/mllib/tree/configuration/Strategy.scala| 5 -
 1 file changed, 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cb61c7b4/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
index a58f01b..b74e3f1 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala
@@ -158,11 +158,6 @@ class Strategy (
   s  Valid values are integers = 0.)
 require(maxBins = 2, sDecisionTree Strategy given invalid maxBins 
parameter: $maxBins. +
   s  Valid values are integers = 2.)
-categoricalFeaturesInfo.foreach { case (feature, arity) =
-  require(arity = 2,
-sDecisionTree Strategy given invalid categoricalFeaturesInfo 
setting: +
-s feature $feature has $arity categories.  The number of categories 
should be = 2.)
-}
 require(minInstancesPerNode = 1,
   sDecisionTree Strategy requires minInstancesPerNode = 1 but was given 
$minInstancesPerNode)
 require(maxMemoryInMB = 10240,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: [SPARK-10040] [SQL] Use batch insert for JDBC writing

2015-08-21 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master dcfe0c5cd - bb220f657


[SPARK-10040] [SQL] Use batch insert for JDBC writing

JIRA: https://issues.apache.org/jira/browse/SPARK-10040

We should use batch insert instead of single row in JDBC.

Author: Liang-Chi Hsieh vii...@appier.com

Closes #8273 from viirya/jdbc-insert-batch.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb220f65
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb220f65
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb220f65

Branch: refs/heads/master
Commit: bb220f6570aa0b95598b30524224a3e82c1effbc
Parents: dcfe0c5
Author: Liang-Chi Hsieh vii...@appier.com
Authored: Fri Aug 21 01:43:49 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Fri Aug 21 01:43:49 2015 -0700

--
 .../sql/execution/datasources/jdbc/JdbcUtils.scala | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb220f65/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
index 2d0e736..26788b2 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala
@@ -88,13 +88,15 @@ object JdbcUtils extends Logging {
   table: String,
   iterator: Iterator[Row],
   rddSchema: StructType,
-  nullTypes: Array[Int]): Iterator[Byte] = {
+  nullTypes: Array[Int],
+  batchSize: Int): Iterator[Byte] = {
 val conn = getConnection()
 var committed = false
 try {
   conn.setAutoCommit(false) // Everything in the same db transaction.
   val stmt = insertStatement(conn, table, rddSchema)
   try {
+var rowCount = 0
 while (iterator.hasNext) {
   val row = iterator.next()
   val numFields = rddSchema.fields.length
@@ -122,7 +124,15 @@ object JdbcUtils extends Logging {
 }
 i = i + 1
   }
-  stmt.executeUpdate()
+  stmt.addBatch()
+  rowCount += 1
+  if (rowCount % batchSize == 0) {
+stmt.executeBatch()
+rowCount = 0
+  }
+}
+if (rowCount  0) {
+  stmt.executeBatch()
 }
   } finally {
 stmt.close()
@@ -211,8 +221,9 @@ object JdbcUtils extends Logging {
 val rddSchema = df.schema
 val driver: String = DriverRegistry.getDriverClassName(url)
 val getConnection: () = Connection = JDBCRDD.getConnector(driver, url, 
properties)
+val batchSize = properties.getProperty(batchsize, 1000).toInt
 df.foreachPartition { iterator =
-  savePartition(getConnection, table, iterator, rddSchema, nullTypes)
+  savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
batchSize)
 }
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org