date:20150818

spark git commit: [MINOR] fix the comments in IndexShuffleBlockResolver

2015-08-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/master dd0614fd6 - c34e9ff0e


[MINOR] fix the comments in IndexShuffleBlockResolver

it might be a typo  introduced at the first moment or some leftover after some 
renaming..

the name of the method accessing the index file is called `getBlockData` now 
(not `getBlockLocation` as indicated in the comments)

Author: CodingCat zhunans...@gmail.com

Closes #8238 from CodingCat/minor_1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c34e9ff0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c34e9ff0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c34e9ff0

Branch: refs/heads/master
Commit: c34e9ff0eac2032283b959fe63b47cc30f28d21c
Parents: dd0614f
Author: CodingCat zhunans...@gmail.com
Authored: Tue Aug 18 10:31:11 2015 +0100
Committer: Sean Owen so...@cloudera.com
Committed: Tue Aug 18 10:31:11 2015 +0100

--
 .../scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c34e9ff0/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
index fae6955..d0163d3 100644
--- 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
+++ 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
@@ -71,7 +71,7 @@ private[spark] class IndexShuffleBlockResolver(conf: 
SparkConf) extends ShuffleB
 
   /**
* Write an index file with the offsets of each block, plus a final offset 
at the end for the
-   * end of the output file. This will be used by getBlockLocation to figure 
out where each block
+   * end of the output file. This will be used by getBlockData to figure out 
where each block
* begins and ends.
* */
   def writeIndexFile(shuffleId: Int, mapId: Int, lengths: Array[Long]): Unit = 
{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] fix the comments in IndexShuffleBlockResolver

2015-08-18 Thread srowen

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 40b89c38a - 42a0b4890


[MINOR] fix the comments in IndexShuffleBlockResolver

it might be a typo  introduced at the first moment or some leftover after some 
renaming..

the name of the method accessing the index file is called `getBlockData` now 
(not `getBlockLocation` as indicated in the comments)

Author: CodingCat zhunans...@gmail.com

Closes #8238 from CodingCat/minor_1.

(cherry picked from commit c34e9ff0eac2032283b959fe63b47cc30f28d21c)
Signed-off-by: Sean Owen so...@cloudera.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/42a0b489
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/42a0b489
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/42a0b489

Branch: refs/heads/branch-1.5
Commit: 42a0b4890a2b91a9104541b10bab7652ceeb3bd2
Parents: 40b89c3
Author: CodingCat zhunans...@gmail.com
Authored: Tue Aug 18 10:31:11 2015 +0100
Committer: Sean Owen so...@cloudera.com
Committed: Tue Aug 18 10:31:25 2015 +0100

--
 .../scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/42a0b489/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
--
diff --git 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
index fae6955..d0163d3 100644
--- 
a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
+++ 
b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala
@@ -71,7 +71,7 @@ private[spark] class IndexShuffleBlockResolver(conf: 
SparkConf) extends ShuffleB
 
   /**
* Write an index file with the offsets of each block, plus a final offset 
at the end for the
-   * end of the output file. This will be used by getBlockLocation to figure 
out where each block
+   * end of the output file. This will be used by getBlockData to figure out 
where each block
* begins and ends.
* */
   def writeIndexFile(shuffleId: Int, mapId: Int, lengths: Array[Long]): Unit = 
{


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2803e8b2e - e5fbe4f24


[SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary 
in ArrayData

The type for array of array in Java is slightly different than array of others.

cc cloud-fan

Author: Davies Liu dav...@databricks.com

Closes #8250 from davies/array_binary.

(cherry picked from commit 5af3838d2e59ed83766f85634e26918baa53819f)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5fbe4f2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5fbe4f2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5fbe4f2

Branch: refs/heads/branch-1.5
Commit: e5fbe4f24aa805d78546e5a11122aa60dde83709
Parents: 2803e8b
Author: Davies Liu dav...@databricks.com
Authored: Mon Aug 17 23:27:55 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Mon Aug 17 23:28:02 2015 -0700

--
 .../codegen/GenerateUnsafeProjection.scala  | 12 ---
 .../codegen/GeneratedProjectionSuite.scala  | 21 +++-
 2 files changed, 29 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e5fbe4f2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
index b2fb913..b570fe8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
@@ -224,7 +224,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 
 // go through the input array to calculate how many bytes we need.
 val calculateNumBytes = elementType match {
-  case _ if (ctx.isPrimitiveType(elementType)) =
+  case _ if ctx.isPrimitiveType(elementType) =
 // Should we do word align?
 val elementSize = elementType.defaultSize
 s
@@ -237,6 +237,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
   case _ =
 val writer = getWriter(elementType)
 val elementSize = s$writer.getSize($elements[$index])
+// TODO(davies): avoid the copy
 val unsafeType = elementType match {
   case _: StructType = UnsafeRow
   case _: ArrayType = UnsafeArrayData
@@ -249,8 +250,13 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
   case _ = 
 }
 
+val newElements = if (elementType == BinaryType) {
+  snew byte[$numElements][]
+} else {
+  snew $unsafeType[$numElements]
+}
 s
-  final $unsafeType[] $elements = new $unsafeType[$numElements];
+  final $unsafeType[] $elements = $newElements;
   for (int $index = 0; $index  $numElements; $index++) {
 ${convertedElement.code}
 if (!${convertedElement.isNull}) {
@@ -262,7 +268,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 }
 
 val writeElement = elementType match {
-  case _ if (ctx.isPrimitiveType(elementType)) =
+  case _ if ctx.isPrimitiveType(elementType) =
 // Should we do word align?
 val elementSize = elementType.defaultSize
 s

http://git-wip-us.apache.org/repos/asf/spark/blob/e5fbe4f2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
index 8c7ee87..098944a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.types.{StringType, IntegerType, StructField, 
StructType}
+import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
@@

spark git commit: [MINOR] Format the comment of `translate` at `functions.scala`

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 35542504c - 2803e8b2e


[MINOR] Format the comment of `translate` at `functions.scala`

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8265 from yu-iskw/minor-translate-comment.

(cherry picked from commit a0910315dae88b033e38a1de07f39ca21f6552ad)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2803e8b2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2803e8b2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2803e8b2

Branch: refs/heads/branch-1.5
Commit: 2803e8b2ee21d74ff9e29f2256d401d63f2b5fef
Parents: 3554250
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Mon Aug 17 23:27:11 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Mon Aug 17 23:27:19 2015 -0700

--
 .../scala/org/apache/spark/sql/functions.scala | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2803e8b2/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 79c5f59..435e631 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1863,14 +1863,15 @@ object functions {
   def substring_index(str: Column, delim: String, count: Int): Column =
 SubstringIndex(str.expr, lit(delim).expr, lit(count).expr)
 
-  /* Translate any character in the src by a character in replaceString.
-  * The characters in replaceString is corresponding to the characters in 
matchingString.
-  * The translate will happen when any character in the string matching with 
the character
-  * in the matchingString.
-  *
-  * @group string_funcs
-  * @since 1.5.0
-  */
+  /**
+   * Translate any character in the src by a character in replaceString.
+   * The characters in replaceString is corresponding to the characters in 
matchingString.
+   * The translate will happen when any character in the string matching with 
the character
+   * in the matchingString.
+   *
+   * @group string_funcs
+   * @since 1.5.0
+   */
   def translate(src: Column, matchingString: String, replaceString: String): 
Column =
 StringTranslate(src.expr, lit(matchingString).expr, 
lit(replaceString).expr)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] Format the comment of `translate` at `functions.scala`

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master e290029a3 - a0910315d


[MINOR] Format the comment of `translate` at `functions.scala`

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8265 from yu-iskw/minor-translate-comment.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0910315
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0910315
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0910315

Branch: refs/heads/master
Commit: a0910315dae88b033e38a1de07f39ca21f6552ad
Parents: e290029
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Mon Aug 17 23:27:11 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Mon Aug 17 23:27:11 2015 -0700

--
 .../scala/org/apache/spark/sql/functions.scala | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a0910315/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
index 79c5f59..435e631 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
@@ -1863,14 +1863,15 @@ object functions {
   def substring_index(str: Column, delim: String, count: Int): Column =
 SubstringIndex(str.expr, lit(delim).expr, lit(count).expr)
 
-  /* Translate any character in the src by a character in replaceString.
-  * The characters in replaceString is corresponding to the characters in 
matchingString.
-  * The translate will happen when any character in the string matching with 
the character
-  * in the matchingString.
-  *
-  * @group string_funcs
-  * @since 1.5.0
-  */
+  /**
+   * Translate any character in the src by a character in replaceString.
+   * The characters in replaceString is corresponding to the characters in 
matchingString.
+   * The translate will happen when any character in the string matching with 
the character
+   * in the matchingString.
+   *
+   * @group string_funcs
+   * @since 1.5.0
+   */
   def translate(src: Column, matchingString: String, replaceString: String): 
Column =
 StringTranslate(src.expr, lit(matchingString).expr, 
lit(replaceString).expr)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master a0910315d - 5af3838d2


[SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary 
in ArrayData

The type for array of array in Java is slightly different than array of others.

cc cloud-fan

Author: Davies Liu dav...@databricks.com

Closes #8250 from davies/array_binary.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5af3838d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5af3838d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5af3838d

Branch: refs/heads/master
Commit: 5af3838d2e59ed83766f85634e26918baa53819f
Parents: a091031
Author: Davies Liu dav...@databricks.com
Authored: Mon Aug 17 23:27:55 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Mon Aug 17 23:27:55 2015 -0700

--
 .../codegen/GenerateUnsafeProjection.scala  | 12 ---
 .../codegen/GeneratedProjectionSuite.scala  | 21 +++-
 2 files changed, 29 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5af3838d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
index b2fb913..b570fe8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
@@ -224,7 +224,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 
 // go through the input array to calculate how many bytes we need.
 val calculateNumBytes = elementType match {
-  case _ if (ctx.isPrimitiveType(elementType)) =
+  case _ if ctx.isPrimitiveType(elementType) =
 // Should we do word align?
 val elementSize = elementType.defaultSize
 s
@@ -237,6 +237,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
   case _ =
 val writer = getWriter(elementType)
 val elementSize = s$writer.getSize($elements[$index])
+// TODO(davies): avoid the copy
 val unsafeType = elementType match {
   case _: StructType = UnsafeRow
   case _: ArrayType = UnsafeArrayData
@@ -249,8 +250,13 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
   case _ = 
 }
 
+val newElements = if (elementType == BinaryType) {
+  snew byte[$numElements][]
+} else {
+  snew $unsafeType[$numElements]
+}
 s
-  final $unsafeType[] $elements = new $unsafeType[$numElements];
+  final $unsafeType[] $elements = $newElements;
   for (int $index = 0; $index  $numElements; $index++) {
 ${convertedElement.code}
 if (!${convertedElement.isNull}) {
@@ -262,7 +268,7 @@ object GenerateUnsafeProjection extends 
CodeGenerator[Seq[Expression], UnsafePro
 }
 
 val writeElement = elementType match {
-  case _ if (ctx.isPrimitiveType(elementType)) =
+  case _ if ctx.isPrimitiveType(elementType) =
 // Should we do word align?
 val elementSize = elementType.defaultSize
 s

http://git-wip-us.apache.org/repos/asf/spark/blob/5af3838d/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
index 8c7ee87..098944a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.types.{StringType, IntegerType, StructField, 
StructType}
+import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.UTF8String
 
 /**
@@ -79,4 +79,23 @@ class GeneratedProjectionSuite extends SparkFunSuite {
 val row2 = mutableProj(result)

spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 5af3838d2 - dd0614fd6


[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

Fix the issue that ```layers``` and ```weights``` should be public variables of 
```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` 
and ```weights``` from a ```MultilayerPerceptronClassificationModel``` 
currently.

Author: Yanbo Liang yblia...@gmail.com

Closes #8263 from yanboliang/mlp-public.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd0614fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd0614fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd0614fd

Branch: refs/heads/master
Commit: dd0614fd618ad28cb77aecfbd49bb319b98fdba0
Parents: 5af3838
Author: Yanbo Liang yblia...@gmail.com
Authored: Mon Aug 17 23:57:02 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 23:57:02 2015 -0700

--
 .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd0614fd/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index c154561..ccca4ec 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 @Experimental
 class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
-layers: Array[Int],
-weights: Vector)
+val layers: Array[Int],
+val weights: Vector)
   extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e5fbe4f24 - 40b89c38a


[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

Fix the issue that ```layers``` and ```weights``` should be public variables of 
```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` 
and ```weights``` from a ```MultilayerPerceptronClassificationModel``` 
currently.

Author: Yanbo Liang yblia...@gmail.com

Closes #8263 from yanboliang/mlp-public.

(cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b89c38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b89c38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b89c38

Branch: refs/heads/branch-1.5
Commit: 40b89c38ada5edfdd1478dc8f3c983ebcbcc56d5
Parents: e5fbe4f
Author: Yanbo Liang yblia...@gmail.com
Authored: Mon Aug 17 23:57:02 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 23:57:14 2015 -0700

--
 .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/40b89c38/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index c154561..ccca4ec 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 @Experimental
 class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
-layers: Array[Int],
-weights: Vector)
+val layers: Array[Int],
+val weights: Vector)
   extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J

2015-08-18 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master c34e9ff0e - 5723d26d7


[SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J

Parquet hard coded a JUL logger which always writes to stdout. This PR 
redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs 
via `log4j.properties`.

This solution is inspired by 
https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909.

Author: Cheng Lian l...@databricks.com

Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5723d26d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5723d26d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5723d26d

Branch: refs/heads/master
Commit: 5723d26d7e677b89383de3fcf2c9a821b68a65b7
Parents: c34e9ff
Author: Cheng Lian l...@databricks.com
Authored: Tue Aug 18 20:15:33 2015 +0800
Committer: Cheng Lian l...@databricks.com
Committed: Tue Aug 18 20:15:33 2015 +0800

--
 conf/log4j.properties.template  |  2 +
 .../datasources/parquet/ParquetRelation.scala   | 77 ++--
 .../parquet/ParquetTableSupport.scala   |  1 -
 .../parquet/ParquetTypesConverter.scala |  3 -
 sql/hive/src/test/resources/log4j.properties|  7 +-
 5 files changed, 47 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5723d26d/conf/log4j.properties.template
--
diff --git a/conf/log4j.properties.template b/conf/log4j.properties.template
index 27006e4..74c5cea 100644
--- a/conf/log4j.properties.template
+++ b/conf/log4j.properties.template
@@ -10,6 +10,8 @@ log4j.logger.org.spark-project.jetty=WARN
 log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
 log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
 log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
+log4j.logger.org.apache.parquet=ERROR
+log4j.logger.parquet=ERROR
 
 # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive support
 log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL

http://git-wip-us.apache.org/repos/asf/spark/blob/5723d26d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
index 52fac18..68169d4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.sql.execution.datasources.parquet
 
 import java.net.URI
-import java.util.logging.{Level, Logger = JLogger}
+import java.util.logging.{Logger = JLogger}
 import java.util.{List = JList}
 
 import scala.collection.JavaConversions._
@@ -31,22 +31,22 @@ import org.apache.hadoop.io.Writable
 import org.apache.hadoop.mapreduce._
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 import org.apache.parquet.filter2.predicate.FilterApi
+import org.apache.parquet.hadoop._
 import org.apache.parquet.hadoop.metadata.CompressionCodecName
 import org.apache.parquet.hadoop.util.ContextUtil
-import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetRecordReader, 
_}
 import org.apache.parquet.schema.MessageType
-import org.apache.parquet.{Log = ParquetLog}
+import org.apache.parquet.{Log = ApacheParquetLog}
+import org.slf4j.bridge.SLF4JBridgeHandler
 
-import org.apache.spark.{Logging, Partition = SparkPartition, SparkException}
 import org.apache.spark.broadcast.Broadcast
-import org.apache.spark.rdd.{SqlNewHadoopPartition, SqlNewHadoopRDD, RDD}
-import org.apache.spark.rdd.RDD._
+import org.apache.spark.rdd.{RDD, SqlNewHadoopPartition, SqlNewHadoopRDD}
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.datasources.PartitionSpec
 import org.apache.spark.sql.sources._
 import org.apache.spark.sql.types.{DataType, StructType}
 import org.apache.spark.util.{SerializableConfiguration, Utils}
+import org.apache.spark.{Logging, Partition = SparkPartition, SparkException}
 
 
 private[sql] class DefaultSource extends HadoopFsRelationProvider with 
DataSourceRegister {
@@ -759,38 +759,39 @@ private[sql] object ParquetRelation extends Logging {
 }.toOption
   }
 
-  def enableLogForwarding() {
-// Note: the org.apache.parquet.Log class has a static initializer that
-

spark git commit: [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J

2015-08-18 Thread lian

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 42a0b4890 - a512250cd


[SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J

Parquet hard coded a JUL logger which always writes to stdout. This PR 
redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs 
via `log4j.properties`.

This solution is inspired by 
https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909.

Author: Cheng Lian l...@databricks.com

Closes #8196 from liancheng/spark-8118/redirect-parquet-jul.

(cherry picked from commit 5723d26d7e677b89383de3fcf2c9a821b68a65b7)
Signed-off-by: Cheng Lian l...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a512250c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a512250c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a512250c

Branch: refs/heads/branch-1.5
Commit: a512250cd19288a7ad8fb600d06544f8728b2dd1
Parents: 42a0b48
Author: Cheng Lian l...@databricks.com
Authored: Tue Aug 18 20:15:33 2015 +0800
Committer: Cheng Lian l...@databricks.com
Committed: Tue Aug 18 20:16:13 2015 +0800

--
 conf/log4j.properties.template  |  2 +
 .../datasources/parquet/ParquetRelation.scala   | 77 ++--
 .../parquet/ParquetTableSupport.scala   |  1 -
 .../parquet/ParquetTypesConverter.scala |  3 -
 sql/hive/src/test/resources/log4j.properties|  7 +-
 5 files changed, 47 insertions(+), 43 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a512250c/conf/log4j.properties.template
--
diff --git a/conf/log4j.properties.template b/conf/log4j.properties.template
index 27006e4..74c5cea 100644
--- a/conf/log4j.properties.template
+++ b/conf/log4j.properties.template
@@ -10,6 +10,8 @@ log4j.logger.org.spark-project.jetty=WARN
 log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
 log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
 log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
+log4j.logger.org.apache.parquet=ERROR
+log4j.logger.parquet=ERROR
 
 # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
UDFs in SparkSQL with Hive support
 log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL

http://git-wip-us.apache.org/repos/asf/spark/blob/a512250c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
index 52fac18..68169d4 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.sql.execution.datasources.parquet
 
 import java.net.URI
-import java.util.logging.{Level, Logger = JLogger}
+import java.util.logging.{Logger = JLogger}
 import java.util.{List = JList}
 
 import scala.collection.JavaConversions._
@@ -31,22 +31,22 @@ import org.apache.hadoop.io.Writable
 import org.apache.hadoop.mapreduce._
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 import org.apache.parquet.filter2.predicate.FilterApi
+import org.apache.parquet.hadoop._
 import org.apache.parquet.hadoop.metadata.CompressionCodecName
 import org.apache.parquet.hadoop.util.ContextUtil
-import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetRecordReader, 
_}
 import org.apache.parquet.schema.MessageType
-import org.apache.parquet.{Log = ParquetLog}
+import org.apache.parquet.{Log = ApacheParquetLog}
+import org.slf4j.bridge.SLF4JBridgeHandler
 
-import org.apache.spark.{Logging, Partition = SparkPartition, SparkException}
 import org.apache.spark.broadcast.Broadcast
-import org.apache.spark.rdd.{SqlNewHadoopPartition, SqlNewHadoopRDD, RDD}
-import org.apache.spark.rdd.RDD._
+import org.apache.spark.rdd.{RDD, SqlNewHadoopPartition, SqlNewHadoopRDD}
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.execution.datasources.PartitionSpec
 import org.apache.spark.sql.sources._
 import org.apache.spark.sql.types.{DataType, StructType}
 import org.apache.spark.util.{SerializableConfiguration, Utils}
+import org.apache.spark.{Logging, Partition = SparkPartition, SparkException}
 
 
 private[sql] class DefaultSource extends HadoopFsRelationProvider with 
DataSourceRegister {
@@ -759,38 +759,39 @@ private[sql] object ParquetRelation extends Logging {

spark git commit: [SPARK-7736] [CORE] Fix a race introduced in PythonRunner.

2015-08-18 Thread vanzin

Repository: spark
Updated Branches:
  refs/heads/master 354f4582b - c1840a862


[SPARK-7736] [CORE] Fix a race introduced in PythonRunner.

The fix for SPARK-7736 introduced a race where a port value of -1
could be passed down to the pyspark process, causing it to fail to
connect back to the JVM. This change adds code to fix that race.

Author: Marcelo Vanzin van...@cloudera.com

Closes #8258 from vanzin/SPARK-7736.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c1840a86
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c1840a86
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c1840a86

Branch: refs/heads/master
Commit: c1840a862eb548bc4306e53ee7e9f26986b31832
Parents: 354f458
Author: Marcelo Vanzin van...@cloudera.com
Authored: Tue Aug 18 11:36:36 2015 -0700
Committer: Marcelo Vanzin van...@cloudera.com
Committed: Tue Aug 18 11:36:36 2015 -0700

--
 .../main/scala/org/apache/spark/deploy/PythonRunner.scala| 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c1840a86/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala
--
diff --git a/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala 
b/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala
index 4277ac2..23d01e9 100644
--- a/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala
@@ -52,10 +52,16 @@ object PythonRunner {
 gatewayServer.start()
   }
 })
-thread.setName(py4j-gateway)
+thread.setName(py4j-gateway-init)
 thread.setDaemon(true)
 thread.start()
 
+// Wait until the gateway server has started, so that we know which port 
is it bound to.
+// `gatewayServer.start()` will start a new thread and run the server code 
there, after
+// initializing the socket, so the thread started above will end as soon 
as the server is
+// ready to serve connections.
+thread.join()
+
 // Build up a PYTHONPATH that includes the Spark assembly JAR (where this 
class is), the
 // python directories in SPARK_HOME (if set), and any files in the pyFiles 
argument
 val pathElements = new ArrayBuffer[String]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 20a760a00 - b86378cf2


[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate 
CountVectorizerModel

jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a 
vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns 
with Word2Vec and the similar parameter in SKlearn.

Author: Yuhao Yang hhb...@gmail.com
Author: Joseph K. Bradley jos...@databricks.com

Closes #7388 from hhbyyh/cvEstimator.

(cherry picked from commit 354f4582b637fa25d3892ec2b12869db50ed83c9)
Signed-off-by: Joseph K. Bradley jos...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b86378cf
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b86378cf
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b86378cf

Branch: refs/heads/branch-1.5
Commit: b86378cf29f8fdb70e41b2f04d831b8a15c1c859
Parents: 20a760a
Author: Yuhao Yang hhb...@gmail.com
Authored: Tue Aug 18 11:00:09 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 11:00:22 2015 -0700

--
 .../spark/ml/feature/CountVectorizer.scala  | 235 +++
 .../spark/ml/feature/CountVectorizerModel.scala |  82 ---
 .../spark/ml/feature/CountVectorizerSuite.scala | 167 +
 .../spark/ml/feature/CountVectorizorSuite.scala |  73 --
 4 files changed, 402 insertions(+), 155 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b86378cf/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
new file mode 100644
index 000..49028e4
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.util.collection.OpenHashMap
+
+/**
+ * Params for [[CountVectorizer]] and [[CountVectorizerModel]].
+ */
+private[feature] trait CountVectorizerParams extends Params with HasInputCol 
with HasOutputCol {
+
+  /**
+   * Max size of the vocabulary.
+   * CountVectorizer will build a vocabulary that only considers the top
+   * vocabSize terms ordered by term frequency across the corpus.
+   *
+   * Default: 2^18^
+   * @group param
+   */
+  val vocabSize: IntParam =
+new IntParam(this, vocabSize, max size of the vocabulary, 
ParamValidators.gt(0))
+
+  /** @group getParam */
+  def getVocabSize: Int = $(vocabSize)
+
+  /**
+   * Specifies the minimum number of different documents a term must appear in 
to be included
+   * in the vocabulary.
+   * If this is an integer = 1, this specifies the number of documents the 
term must appear in;
+   * if this is a double in [0,1), then this specifies the fraction of 
documents.
+   *
+   * Default: 1
+   * @group param
+   */
+  val minDF: DoubleParam = new DoubleParam(this, minDF, Specifies the 
minimum number of +
+ different documents a term must appear in to be included in the 
vocabulary. +
+ If this is an integer = 1, this specifies the number of documents the 
term must +
+ appear in; if this is a double in [0,1), then this specifies the 
fraction of

spark git commit: [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 1968276af - 354f4582b


[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate 
CountVectorizerModel

jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a 
vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns 
with Word2Vec and the similar parameter in SKlearn.

Author: Yuhao Yang hhb...@gmail.com
Author: Joseph K. Bradley jos...@databricks.com

Closes #7388 from hhbyyh/cvEstimator.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/354f4582
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/354f4582
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/354f4582

Branch: refs/heads/master
Commit: 354f4582b637fa25d3892ec2b12869db50ed83c9
Parents: 1968276
Author: Yuhao Yang hhb...@gmail.com
Authored: Tue Aug 18 11:00:09 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 11:00:09 2015 -0700

--
 .../spark/ml/feature/CountVectorizer.scala  | 235 +++
 .../spark/ml/feature/CountVectorizerModel.scala |  82 ---
 .../spark/ml/feature/CountVectorizerSuite.scala | 167 +
 .../spark/ml/feature/CountVectorizorSuite.scala |  73 --
 4 files changed, 402 insertions(+), 155 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/354f4582/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
new file mode 100644
index 000..49028e4
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala
@@ -0,0 +1,235 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.util.collection.OpenHashMap
+
+/**
+ * Params for [[CountVectorizer]] and [[CountVectorizerModel]].
+ */
+private[feature] trait CountVectorizerParams extends Params with HasInputCol 
with HasOutputCol {
+
+  /**
+   * Max size of the vocabulary.
+   * CountVectorizer will build a vocabulary that only considers the top
+   * vocabSize terms ordered by term frequency across the corpus.
+   *
+   * Default: 2^18^
+   * @group param
+   */
+  val vocabSize: IntParam =
+new IntParam(this, vocabSize, max size of the vocabulary, 
ParamValidators.gt(0))
+
+  /** @group getParam */
+  def getVocabSize: Int = $(vocabSize)
+
+  /**
+   * Specifies the minimum number of different documents a term must appear in 
to be included
+   * in the vocabulary.
+   * If this is an integer = 1, this specifies the number of documents the 
term must appear in;
+   * if this is a double in [0,1), then this specifies the fraction of 
documents.
+   *
+   * Default: 1
+   * @group param
+   */
+  val minDF: DoubleParam = new DoubleParam(this, minDF, Specifies the 
minimum number of +
+ different documents a term must appear in to be included in the 
vocabulary. +
+ If this is an integer = 1, this specifies the number of documents the 
term must +
+ appear in; if this is a double in [0,1), then this specifies the 
fraction of documents.,
+ParamValidators.gtEq(0.0))
+
+  /** @group getParam */
+  def getMinDF: Double = $(minDF)
+
+  /** Validates and transforms

spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 b86378cf2 - 7ff0e5d2f


[SPARK-9900] [MLLIB] User guide for Association Rules

Updates FPM user guide to include Association Rules.

Author: Feynman Liang fli...@databricks.com

Closes #8207 from feynmanliang/SPARK-9900-arules.

(cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff0e5d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff0e5d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff0e5d2

Branch: refs/heads/branch-1.5
Commit: 7ff0e5d2fe07d4a9518ade26b09bcc32f418ca1b
Parents: b86378c
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 12:53:57 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:54:05 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md   | 130 +--
 docs/mllib-guide.md |   1 +
 .../mllib/fpm/JavaAssociationRulesSuite.java|   2 +-
 3 files changed, 118 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ff0e5d2/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index 8ea4389..6c06550 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following 
(hyper-)parameters:
 div class=codetabs
 div data-lang=scala markdown=1
 
-[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) 
implements the
-FP-growth algorithm.
-It take a `JavaRDD` of transactions, where each transaction is an `Iterable` 
of items of a generic type.
+[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth)
+implements the FP-growth algorithm.  It take an `RDD` of transactions,
+where each transaction is an `Iterable` of items of a generic type.
 Calling `FPGrowth.run` with transactions returns an
 
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
+
 
 {% highlight scala %}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
 
-val transactions: RDD[Array[String]] = ...
+val transactions: RDD[Array[String]] = sc.parallelize(Seq(
+  r z h k p,
+  z y x w v u t s,
+  s x o n r,
+  x z y m t s q e,
+  z,
+  x z y r q t p)
+  .map(_.split( )))
 
 val fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -60,29 +72,48 @@ val model = fpg.run(transactions)
 model.freqItemsets.collect().foreach { itemset =
   println(itemset.items.mkString([, ,, ]) + ,  + itemset.freq)
 }
+
+val minConfidence = 0.8
+model.generateAssociationRules(minConfidence).collect().foreach { rule =
+  println(
+rule.antecedent.mkString([, ,, ])
+  +  =  + rule.consequent .mkString([, ,, ])
+  + ,  + rule.confidence)
+}
 {% endhighlight %}
 
 /div
 
 div data-lang=java markdown=1
 
-[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
-FP-growth algorithm.
-It take an `RDD` of transactions, where each transaction is an `Array` of 
items of a generic type.
-Calling `FPGrowth.run` with transactions returns an
+[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html)
+implements the FP-growth algorithm.  It take a `JavaRDD` of
+transactions, where each transaction is an `Array` of items of a generic
+type.  Calling `FPGrowth.run` with transactions returns an
 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
 
 {% highlight java %}
+import java.util.Arrays;
 import java.util.List;
 
-import com.google.common.base.Joiner;
-
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
 import org.apache.spark.mllib.fpm.FPGrowth;
 import org.apache.spark.mllib.fpm.FPGrowthModel;
 
-JavaRDDListString transactions = ...
+JavaRDDListString transactions = sc.parallelize(Arrays.asList(
+  Arrays.asList(r z h

spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 ec7079f9c - 9bd2e6f7c


[SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

See https://issues.apache.org/jira/browse/SPARK-10085

Author: Piotr Migdal pmig...@gmail.com

Closes #8284 from stared/spark-10085.

(cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bd2e6f7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bd2e6f7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bd2e6f7

Branch: refs/heads/branch-1.5
Commit: 9bd2e6f7cbff1835f9abefe26dbe445eaa5b004b
Parents: ec7079f
Author: Piotr Migdal pmig...@gmail.com
Authored: Tue Aug 18 12:59:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:59:36 2015 -0700

--
 docs/mllib-linear-methods.md | 2 --
 1 file changed, 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9bd2e6f7/docs/mllib-linear-methods.md
--
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 07655ba..e9b2d27 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -504,7 +504,6 @@ will in the future.
 {% highlight python %}
 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):
@@ -676,7 +675,6 @@ Note that the Python API does not yet support model 
save/load but will in the fu
 
 {% highlight python %}
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, 
LinearRegressionModel
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 747c2ba80 - 8bae9015b


[SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

See https://issues.apache.org/jira/browse/SPARK-10085

Author: Piotr Migdal pmig...@gmail.com

Closes #8284 from stared/spark-10085.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bae9015
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bae9015
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bae9015

Branch: refs/heads/master
Commit: 8bae9015b7e7b4528ca2bc5180771cb95d2aac13
Parents: 747c2ba
Author: Piotr Migdal pmig...@gmail.com
Authored: Tue Aug 18 12:59:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:59:28 2015 -0700

--
 docs/mllib-linear-methods.md | 2 --
 1 file changed, 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8bae9015/docs/mllib-linear-methods.md
--
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 07655ba..e9b2d27 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -504,7 +504,6 @@ will in the future.
 {% highlight python %}
 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):
@@ -676,7 +675,6 @@ Note that the Python API does not yet support model 
save/load but will in the fu
 
 {% highlight python %}
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, 
LinearRegressionModel
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f5ea39129 - f4fa61eff


[SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression 
user guide

Add Python examples for mllib IsotonicRegression user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8225 from yanboliang/spark-10029.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fa61ef
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fa61ef
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fa61ef

Branch: refs/heads/master
Commit: f4fa61effe34dae2f0eab0bef57b2dee220cf92f
Parents: f5ea391
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:55:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:55:36 2015 -0700

--
 docs/mllib-isotonic-regression.md | 35 ++
 1 file changed, 35 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f4fa61ef/docs/mllib-isotonic-regression.md
--
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 5732bc4..6aa881f 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath);
 IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), 
myModelPath);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is calculated 
from the predicted
+labels and real labels in the test set.
+
+{% highlight python %}
+import math
+from pyspark.mllib.regression import IsotonicRegression, 
IsotonicRegressionModel
+
+data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt)
+
+# Create label, feature, weight tuples from input data with weight set to 
default value 1.0.
+parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) 
+ (1.0,))
+
+# Split data into training (60%) and test (40%) sets.
+training, test = parsedData.randomSplit([0.6, 0.4], 11)
+
+# Create isotonic regression model from training data.
+# Isotonic parameter defaults to true so it is only shown for demonstration
+model = IsotonicRegression.train(training)
+
+# Create tuples of predicted and real labels.
+predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0]))
+
+# Calculate mean squared error between predicted and real labels.
+meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 
2)).mean()
+print(Mean Squared Error =  + str(meanSquaredError))
+
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = IsotonicRegressionModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master bf1d6614d - 80cb25b22


[SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation

Turns out that inner classes of inner objects are referenced directly, and thus 
moving it will break binary compatibility.

Author: Michael Armbrust mich...@databricks.com

Closes #8281 from marmbrus/binaryCompat.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80cb25b2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80cb25b2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80cb25b2

Branch: refs/heads/master
Commit: 80cb25b228e821a80256546a2f03f73a45cf7645
Parents: bf1d661
Author: Michael Armbrust mich...@databricks.com
Authored: Tue Aug 18 13:50:51 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 13:50:51 2015 -0700

--
 .../main/scala/org/apache/spark/sql/SQLContext.scala| 11 +++
 .../main/scala/org/apache/spark/sql/SQLImplicits.scala  | 10 --
 .../org/apache/spark/sql/test/SharedSQLContext.scala| 12 +++-
 3 files changed, 22 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
index 53de10d..58fe75b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@@ -334,6 +334,17 @@ class SQLContext(@transient val sparkContext: SparkContext)
   @Experimental
   object implicits extends SQLImplicits with Serializable {
 protected override def _sqlContext: SQLContext = self
+
+/**
+ * Converts $col name into an [[Column]].
+ * @since 1.3.0
+ */
+// This must live here to preserve binary compatibility with Spark  1.5.
+implicit class StringToColumn(val sc: StringContext) {
+  def $(args: Any*): ColumnName = {
+new ColumnName(sc.s(args: _*))
+  }
+}
   }
   // scalastyle:on
 

http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
index 5f82372..47b6f80 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
@@ -34,16 +34,6 @@ private[sql] abstract class SQLImplicits {
   protected def _sqlContext: SQLContext
 
   /**
-   * Converts $col name into an [[Column]].
-   * @since 1.3.0
-   */
-  implicit class StringToColumn(val sc: StringContext) {
-def $(args: Any*): ColumnName = {
-  new ColumnName(sc.s(args: _*))
-}
-  }
-
-  /**
* An implicit conversion that turns a Scala `Symbol` into a [[Column]].
* @since 1.3.0
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
index 3cfd822..8a061b6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.test
 
-import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.{ColumnName, SQLContext}
 
 
 /**
@@ -65,4 +65,14 @@ private[sql] trait SharedSQLContext extends SQLTestUtils {
 }
   }
 
+  /**
+   * Converts $col name into an [[Column]].
+   * @since 1.3.0
+   */
+  // This must be duplicated here to preserve binary compatibility with Spark 
 1.5.
+  implicit class StringToColumn(val sc: StringContext) {
+def $(args: Any*): ColumnName = {
+  new ColumnName(sc.s(args: _*))
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2bccd918f - 80a6fb521


[SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation

Turns out that inner classes of inner objects are referenced directly, and thus 
moving it will break binary compatibility.

Author: Michael Armbrust mich...@databricks.com

Closes #8281 from marmbrus/binaryCompat.

(cherry picked from commit 80cb25b228e821a80256546a2f03f73a45cf7645)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80a6fb52
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80a6fb52
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80a6fb52

Branch: refs/heads/branch-1.5
Commit: 80a6fb521a2eb9eadc3529f81a74585f7aaf5e26
Parents: 2bccd91
Author: Michael Armbrust mich...@databricks.com
Authored: Tue Aug 18 13:50:51 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 13:51:03 2015 -0700

--
 .../main/scala/org/apache/spark/sql/SQLContext.scala| 11 +++
 .../main/scala/org/apache/spark/sql/SQLImplicits.scala  | 10 --
 .../org/apache/spark/sql/test/SharedSQLContext.scala| 12 +++-
 3 files changed, 22 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
index 53de10d..58fe75b 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala
@@ -334,6 +334,17 @@ class SQLContext(@transient val sparkContext: SparkContext)
   @Experimental
   object implicits extends SQLImplicits with Serializable {
 protected override def _sqlContext: SQLContext = self
+
+/**
+ * Converts $col name into an [[Column]].
+ * @since 1.3.0
+ */
+// This must live here to preserve binary compatibility with Spark  1.5.
+implicit class StringToColumn(val sc: StringContext) {
+  def $(args: Any*): ColumnName = {
+new ColumnName(sc.s(args: _*))
+  }
+}
   }
   // scalastyle:on
 

http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
index 5f82372..47b6f80 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala
@@ -34,16 +34,6 @@ private[sql] abstract class SQLImplicits {
   protected def _sqlContext: SQLContext
 
   /**
-   * Converts $col name into an [[Column]].
-   * @since 1.3.0
-   */
-  implicit class StringToColumn(val sc: StringContext) {
-def $(args: Any*): ColumnName = {
-  new ColumnName(sc.s(args: _*))
-}
-  }
-
-  /**
* An implicit conversion that turns a Scala `Symbol` into a [[Column]].
* @since 1.3.0
*/

http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
index 3cfd822..8a061b6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
@@ -17,7 +17,7 @@
 
 package org.apache.spark.sql.test
 
-import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.{ColumnName, SQLContext}
 
 
 /**
@@ -65,4 +65,14 @@ private[sql] trait SharedSQLContext extends SQLTestUtils {
 }
   }
 
+  /**
+   * Converts $col name into an [[Column]].
+   * @since 1.3.0
+   */
+  // This must be duplicated here to preserve binary compatibility with Spark 
 1.5.
+  implicit class StringToColumn(val sc: StringContext) {
+def $(args: Any*): ColumnName = {
+  new ColumnName(sc.s(args: _*))
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 56f4da263 - fb207b245


[SPARK-10012] [ML] Missing test case for Params#arrayLengthGt

Currently there is no test case for `Params#arrayLengthGt`.

Author: lewuathe lewua...@me.com

Closes #8223 from Lewuathe/SPARK-10012.

(cherry picked from commit c635a16f64c939182196b46725ef2d00ed107cca)
Signed-off-by: Joseph K. Bradley jos...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb207b24
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb207b24
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb207b24

Branch: refs/heads/branch-1.5
Commit: fb207b245305b30b4fe47e08f98f2571a2d05249
Parents: 56f4da2
Author: lewuathe lewua...@me.com
Authored: Tue Aug 18 15:30:23 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 15:30:34 2015 -0700

--
 mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala | 3 +++
 1 file changed, 3 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fb207b24/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
index be95638..2c878f8 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
@@ -199,6 +199,9 @@ class ParamsSuite extends SparkFunSuite {
 
 val inArray = ParamValidators.inArray[Int](Array(1, 2))
 assert(inArray(1)  inArray(2)  !inArray(0))
+
+val arrayLengthGt = ParamValidators.arrayLengthGt[Int](2.0)
+assert(arrayLengthGt(Array(0, 1, 2))  !arrayLengthGt(Array(0, 1)))
   }
 
   test(Params.copyValues) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master c1840a862 - f5ea39129


[SPARK-9900] [MLLIB] User guide for Association Rules

Updates FPM user guide to include Association Rules.

Author: Feynman Liang fli...@databricks.com

Closes #8207 from feynmanliang/SPARK-9900-arules.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea3912
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea3912
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea3912

Branch: refs/heads/master
Commit: f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b
Parents: c1840a8
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 12:53:57 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:53:57 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md   | 130 +--
 docs/mllib-guide.md |   1 +
 .../mllib/fpm/JavaAssociationRulesSuite.java|   2 +-
 3 files changed, 118 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index 8ea4389..6c06550 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following 
(hyper-)parameters:
 div class=codetabs
 div data-lang=scala markdown=1
 
-[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) 
implements the
-FP-growth algorithm.
-It take a `JavaRDD` of transactions, where each transaction is an `Iterable` 
of items of a generic type.
+[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth)
+implements the FP-growth algorithm.  It take an `RDD` of transactions,
+where each transaction is an `Iterable` of items of a generic type.
 Calling `FPGrowth.run` with transactions returns an
 
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
+
 
 {% highlight scala %}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
 
-val transactions: RDD[Array[String]] = ...
+val transactions: RDD[Array[String]] = sc.parallelize(Seq(
+  r z h k p,
+  z y x w v u t s,
+  s x o n r,
+  x z y m t s q e,
+  z,
+  x z y r q t p)
+  .map(_.split( )))
 
 val fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -60,29 +72,48 @@ val model = fpg.run(transactions)
 model.freqItemsets.collect().foreach { itemset =
   println(itemset.items.mkString([, ,, ]) + ,  + itemset.freq)
 }
+
+val minConfidence = 0.8
+model.generateAssociationRules(minConfidence).collect().foreach { rule =
+  println(
+rule.antecedent.mkString([, ,, ])
+  +  =  + rule.consequent .mkString([, ,, ])
+  + ,  + rule.confidence)
+}
 {% endhighlight %}
 
 /div
 
 div data-lang=java markdown=1
 
-[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
-FP-growth algorithm.
-It take an `RDD` of transactions, where each transaction is an `Array` of 
items of a generic type.
-Calling `FPGrowth.run` with transactions returns an
+[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html)
+implements the FP-growth algorithm.  It take a `JavaRDD` of
+transactions, where each transaction is an `Array` of items of a generic
+type.  Calling `FPGrowth.run` with transactions returns an
 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
 
 {% highlight java %}
+import java.util.Arrays;
 import java.util.List;
 
-import com.google.common.base.Joiner;
-
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
 import org.apache.spark.mllib.fpm.FPGrowth;
 import org.apache.spark.mllib.fpm.FPGrowthModel;
 
-JavaRDDListString transactions = ...
+JavaRDDListString transactions = sc.parallelize(Arrays.asList(
+  Arrays.asList(r z h k p.split( )),
+  Arrays.asList(z y x w v u t s.split( )),
+  Arrays.asList(s x o n r.split( )),
+  Arrays.asList(x z y m t s

spark git commit: [SPARK-10089] [SQL] Add missing golden files.

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 80a6fb521 - 74a6b1a13


[SPARK-10089] [SQL] Add missing golden files.

Author: Marcelo Vanzin van...@cloudera.com

Closes #8283 from vanzin/SPARK-10089.

(cherry picked from commit fa41e0242f075843beff7dc600d1a6bac004bdc7)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74a6b1a1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74a6b1a1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74a6b1a1

Branch: refs/heads/branch-1.5
Commit: 74a6b1a131a53a69527d56680c03db9caeefbba7
Parents: 80a6fb5
Author: Marcelo Vanzin van...@cloudera.com
Authored: Tue Aug 18 14:43:05 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 14:43:19 2015 -0700

--
 ...uery test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 |   3 +
 ...uery test-0-eabbebd5c1d127b1605bfec52d7b7f3f | 500 +++
 2 files changed, 503 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/74a6b1a1/sql/hive/src/test/resources/golden/Column
 pruning - non-trivial top project with aliases - query 
test-0-515e406ffb23f6fd0d8cd34c2b25fbe6
--
diff --git a/sql/hive/src/test/resources/golden/Column pruning - non-trivial 
top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 
b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project 
with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6
new file mode 100644
index 000..9a276bc
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top 
project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6  
@@ -0,0 +1,3 @@
+476
+172
+622

http://git-wip-us.apache.org/repos/asf/spark/blob/74a6b1a1/sql/hive/src/test/resources/golden/Partition
 pruning - non-partitioned, non-trivial project - query 
test-0-eabbebd5c1d127b1605bfec52d7b7f3f
--
diff --git a/sql/hive/src/test/resources/golden/Partition pruning - 
non-partitioned, non-trivial project - query 
test-0-eabbebd5c1d127b1605bfec52d7b7f3f 
b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, 
non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f
new file mode 100644
index 000..444039e
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, 
non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f   
@@ -0,0 +1,500 @@
+476
+172
+622
+54
+330
+818
+510
+556
+196
+968
+530
+386
+802
+300
+546
+448
+738
+132
+256
+426
+292
+812
+858
+748
+304
+938
+290
+990
+74
+654
+562
+554
+418
+30
+164
+806
+332
+834
+860
+504
+584
+438
+574
+306
+386
+676
+892
+918
+788
+474
+964
+348
+826
+988
+414
+398
+932
+416
+348
+798
+792
+494
+834
+978
+324
+754
+794
+618
+730
+532
+878
+684
+734
+650
+334
+390
+950
+34
+226
+310
+406
+678
+0
+910
+256
+622
+632
+114
+604
+410
+298
+876
+690
+258
+340
+40
+978
+314
+756
+442
+184
+222
+94
+144
+8
+560
+70
+854
+554
+416
+712
+798
+338
+764
+996
+250
+772
+874
+938
+384
+572
+374
+352
+108
+918
+102
+276
+206
+478
+426
+432
+860
+556
+352
+578
+442
+130
+636
+664
+622
+550
+274
+482
+166
+666
+360
+568
+24
+460
+362
+134
+520
+808
+768
+978
+706
+746
+544
+276
+434
+168
+696
+932
+116
+16
+822
+460
+416
+696
+48
+926
+862
+358
+344
+84
+258
+316
+238
+992
+0
+644
+394
+936
+786
+908
+200
+596
+398
+382
+836
+192
+52
+330
+654
+460
+410
+240
+262
+102
+808
+86
+872
+312
+938
+936
+616
+190
+392
+576
+962
+914
+196
+564
+394
+374
+636
+636
+818
+940
+274
+738
+632
+338
+826
+170
+154
+0
+980
+174
+728
+358
+236
+268
+790
+564
+276
+476
+838
+30
+236
+144
+180
+614
+38
+870
+20
+554
+546
+612
+448
+618
+778
+654
+484
+738
+784
+544
+662
+802
+484
+904
+354
+452
+10
+994
+804
+792
+634
+790
+116
+70
+672
+190
+22
+336
+68
+458
+466
+286
+944
+644
+996
+320
+390
+84
+642
+860
+238
+978
+916
+156
+152
+82
+446
+984
+298
+898
+436
+456
+276
+906
+60
+418
+128
+936
+152
+148
+684
+138
+460
+66
+736
+206
+592
+226
+432
+734
+688
+334
+548
+438
+478
+970
+232
+446
+512
+526
+140
+974
+960
+802
+576
+382
+10
+488
+876
+256
+934
+864
+404
+632
+458
+938
+926
+560
+4
+70
+566
+662
+470
+160
+88
+386
+642
+670
+208
+932
+732
+350
+806
+966
+106
+210
+514
+812
+818
+380
+812
+802
+228
+516
+180
+406
+524
+696
+848
+24
+792
+402
+434
+328
+862
+908
+956
+596
+250
+862
+328
+848
+374
+764
+10
+140
+794
+960
+582
+48
+702
+510
+208
+140
+326
+876
+238
+828
+400
+982
+474
+878
+720
+496
+958
+610
+834
+398
+888
+240
+858
+338
+886
+646
+650
+554
+460
+956
+356
+936
+620
+634
+666
+986
+920
+414
+498
+530
+960
+166
+272
+706
+344
+428
+924
+466
+812
+266
+350
+378
+908
+750
+802
+842

spark git commit: [SPARK-10089] [SQL] Add missing golden files.

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master 9b731fad2 - fa41e0242


[SPARK-10089] [SQL] Add missing golden files.

Author: Marcelo Vanzin van...@cloudera.com

Closes #8283 from vanzin/SPARK-10089.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa41e024
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa41e024
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa41e024

Branch: refs/heads/master
Commit: fa41e0242f075843beff7dc600d1a6bac004bdc7
Parents: 9b731fa
Author: Marcelo Vanzin van...@cloudera.com
Authored: Tue Aug 18 14:43:05 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 14:43:05 2015 -0700

--
 ...uery test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 |   3 +
 ...uery test-0-eabbebd5c1d127b1605bfec52d7b7f3f | 500 +++
 2 files changed, 503 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fa41e024/sql/hive/src/test/resources/golden/Column
 pruning - non-trivial top project with aliases - query 
test-0-515e406ffb23f6fd0d8cd34c2b25fbe6
--
diff --git a/sql/hive/src/test/resources/golden/Column pruning - non-trivial 
top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 
b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project 
with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6
new file mode 100644
index 000..9a276bc
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top 
project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6  
@@ -0,0 +1,3 @@
+476
+172
+622

http://git-wip-us.apache.org/repos/asf/spark/blob/fa41e024/sql/hive/src/test/resources/golden/Partition
 pruning - non-partitioned, non-trivial project - query 
test-0-eabbebd5c1d127b1605bfec52d7b7f3f
--
diff --git a/sql/hive/src/test/resources/golden/Partition pruning - 
non-partitioned, non-trivial project - query 
test-0-eabbebd5c1d127b1605bfec52d7b7f3f 
b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, 
non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f
new file mode 100644
index 000..444039e
--- /dev/null
+++ b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, 
non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f   
@@ -0,0 +1,500 @@
+476
+172
+622
+54
+330
+818
+510
+556
+196
+968
+530
+386
+802
+300
+546
+448
+738
+132
+256
+426
+292
+812
+858
+748
+304
+938
+290
+990
+74
+654
+562
+554
+418
+30
+164
+806
+332
+834
+860
+504
+584
+438
+574
+306
+386
+676
+892
+918
+788
+474
+964
+348
+826
+988
+414
+398
+932
+416
+348
+798
+792
+494
+834
+978
+324
+754
+794
+618
+730
+532
+878
+684
+734
+650
+334
+390
+950
+34
+226
+310
+406
+678
+0
+910
+256
+622
+632
+114
+604
+410
+298
+876
+690
+258
+340
+40
+978
+314
+756
+442
+184
+222
+94
+144
+8
+560
+70
+854
+554
+416
+712
+798
+338
+764
+996
+250
+772
+874
+938
+384
+572
+374
+352
+108
+918
+102
+276
+206
+478
+426
+432
+860
+556
+352
+578
+442
+130
+636
+664
+622
+550
+274
+482
+166
+666
+360
+568
+24
+460
+362
+134
+520
+808
+768
+978
+706
+746
+544
+276
+434
+168
+696
+932
+116
+16
+822
+460
+416
+696
+48
+926
+862
+358
+344
+84
+258
+316
+238
+992
+0
+644
+394
+936
+786
+908
+200
+596
+398
+382
+836
+192
+52
+330
+654
+460
+410
+240
+262
+102
+808
+86
+872
+312
+938
+936
+616
+190
+392
+576
+962
+914
+196
+564
+394
+374
+636
+636
+818
+940
+274
+738
+632
+338
+826
+170
+154
+0
+980
+174
+728
+358
+236
+268
+790
+564
+276
+476
+838
+30
+236
+144
+180
+614
+38
+870
+20
+554
+546
+612
+448
+618
+778
+654
+484
+738
+784
+544
+662
+802
+484
+904
+354
+452
+10
+994
+804
+792
+634
+790
+116
+70
+672
+190
+22
+336
+68
+458
+466
+286
+944
+644
+996
+320
+390
+84
+642
+860
+238
+978
+916
+156
+152
+82
+446
+984
+298
+898
+436
+456
+276
+906
+60
+418
+128
+936
+152
+148
+684
+138
+460
+66
+736
+206
+592
+226
+432
+734
+688
+334
+548
+438
+478
+970
+232
+446
+512
+526
+140
+974
+960
+802
+576
+382
+10
+488
+876
+256
+934
+864
+404
+632
+458
+938
+926
+560
+4
+70
+566
+662
+470
+160
+88
+386
+642
+670
+208
+932
+732
+350
+806
+966
+106
+210
+514
+812
+818
+380
+812
+802
+228
+516
+180
+406
+524
+696
+848
+24
+792
+402
+434
+328
+862
+908
+956
+596
+250
+862
+328
+848
+374
+764
+10
+140
+794
+960
+582
+48
+702
+510
+208
+140
+326
+876
+238
+828
+400
+982
+474
+878
+720
+496
+958
+610
+834
+398
+888
+240
+858
+338
+886
+646
+650
+554
+460
+956
+356
+936
+620
+634
+666
+986
+920
+414
+498
+530
+960
+166
+272
+706
+344
+428
+924
+466
+812
+266
+350
+378
+908
+750
+802
+842
+814
+768
+512
+52
+268
+134
+768
+758
+36
+924
+984
+200
+596
+18
+682
+996
+292
+916
+724
+372
+570
+696
+334
+36
+546
+366
+562

spark git commit: [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 1dbffba37 - c635a16f6


[SPARK-10012] [ML] Missing test case for Params#arrayLengthGt

Currently there is no test case for `Params#arrayLengthGt`.

Author: lewuathe lewua...@me.com

Closes #8223 from Lewuathe/SPARK-10012.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c635a16f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c635a16f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c635a16f

Branch: refs/heads/master
Commit: c635a16f64c939182196b46725ef2d00ed107cca
Parents: 1dbffba
Author: lewuathe lewua...@me.com
Authored: Tue Aug 18 15:30:23 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 15:30:23 2015 -0700

--
 mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala | 3 +++
 1 file changed, 3 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c635a16f/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
index be95638..2c878f8 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
@@ -199,6 +199,9 @@ class ParamsSuite extends SparkFunSuite {
 
 val inArray = ParamValidators.inArray[Int](Array(1, 2))
 assert(inArray(1)  inArray(2)  !inArray(0))
+
+val arrayLengthGt = ParamValidators.arrayLengthGt[Int](2.0)
+assert(arrayLengthGt(Array(0, 1, 2))  !arrayLengthGt(Array(0, 1)))
   }
 
   test(Params.copyValues) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARKR] [MINOR] Get rid of a long line warning

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 9b42e2404 - 0a1385e31


[SPARKR] [MINOR] Get rid of a long line warning

```
R/functions.R:74:1: style: lines should not be more than 100 characters.
jc - callJStatic(org.apache.spark.sql.functions, lit, 
ifelse(class(x) == Column, xjc, x))
^
```

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8297 from yu-iskw/minor-lint-r.

(cherry picked from commit b4b35f133aecaf84f04e8e444b660a33c6b7894a)
Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a1385e3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a1385e3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a1385e3

Branch: refs/heads/branch-1.5
Commit: 0a1385e319a2bca115b6bfefe7820b78ce5fb753
Parents: 9b42e24
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 19:18:05 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 19:18:13 2015 -0700

--
 R/pkg/R/functions.R | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0a1385e3/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 6eef4d6..e606b20 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -71,7 +71,9 @@ createFunctions()
 #' @return Creates a Column class of literal value.
 setMethod(lit, signature(ANY),
   function(x) {
-jc - callJStatic(org.apache.spark.sql.functions, lit, 
ifelse(class(x) == Column, x@jc, x))
+jc - callJStatic(org.apache.spark.sql.functions,
+  lit,
+  ifelse(class(x) == Column, x@jc, x))
 column(jc)
   })
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10095] [SQL] use public API of BigInteger

2015-08-18 Thread davies

Repository: spark
Updated Branches:
  refs/heads/master bf32c1f7f - 270ee6777


[SPARK-10095] [SQL] use public API of BigInteger

In UnsafeRow, we use the private field of BigInteger for better performance, 
but it actually didn't contribute much (3% in one benchmark) to end-to-end 
runtime, and make it not portable (may fail on other JVM implementations).

So we should use the public API instead.

cc rxin

Author: Davies Liu dav...@databricks.com

Closes #8286 from davies/portable_decimal.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/270ee677
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/270ee677
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/270ee677

Branch: refs/heads/master
Commit: 270ee677750a1f2adaf24b5816857194e61782ff
Parents: bf32c1f
Author: Davies Liu dav...@databricks.com
Authored: Tue Aug 18 20:39:59 2015 -0700
Committer: Davies Liu davies@gmail.com
Committed: Tue Aug 18 20:39:59 2015 -0700

--
 .../sql/catalyst/expressions/UnsafeRow.java | 29 ++--
 .../catalyst/expressions/UnsafeRowWriters.java  |  9 ++
 .../java/org/apache/spark/unsafe/Platform.java  | 18 
 3 files changed, 11 insertions(+), 45 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/270ee677/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
index 7fd9477..6c02004 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
@@ -273,14 +273,13 @@ public final class UnsafeRow extends MutableRow {
   } else {
 
 final BigInteger integer = value.toJavaBigDecimal().unscaledValue();
-final int[] mag = (int[]) Platform.getObjectVolatile(integer,
-  Platform.BIG_INTEGER_MAG_OFFSET);
-assert(mag.length = 4);
+byte[] bytes = integer.toByteArray();
+assert(bytes.length = 16);
 
 // Write the bytes to the variable length portion.
 Platform.copyMemory(
-  mag, Platform.INT_ARRAY_OFFSET, baseObject, baseOffset + cursor, 
mag.length * 4);
-setLong(ordinal, (cursor  32) | ((long) (((integer.signum() + 1)  
8) + mag.length)));
+  bytes, Platform.BYTE_ARRAY_OFFSET, baseObject, baseOffset + cursor, 
bytes.length);
+setLong(ordinal, (cursor  32) | ((long) bytes.length));
   }
 }
   }
@@ -375,8 +374,6 @@ public final class UnsafeRow extends MutableRow {
 return Platform.getDouble(baseObject, getFieldOffset(ordinal));
   }
 
-  private static byte[] EMPTY = new byte[0];
-
   @Override
   public Decimal getDecimal(int ordinal, int precision, int scale) {
 if (isNullAt(ordinal)) {
@@ -385,20 +382,10 @@ public final class UnsafeRow extends MutableRow {
 if (precision = Decimal.MAX_LONG_DIGITS()) {
   return Decimal.apply(getLong(ordinal), precision, scale);
 } else {
-  long offsetAndSize = getLong(ordinal);
-  long offset = offsetAndSize  32;
-  int signum = ((int) (offsetAndSize  0xfff)  8);
-  assert signum =0  signum = 2 : invalid signum  + signum;
-  int size = (int) (offsetAndSize  0xff);
-  int[] mag = new int[size];
-  Platform.copyMemory(
-baseObject, baseOffset + offset, mag, Platform.INT_ARRAY_OFFSET, size 
* 4);
-
-  // create a BigInteger using signum and mag
-  BigInteger v = new BigInteger(0, EMPTY);  // create the initial object
-  Platform.putInt(v, Platform.BIG_INTEGER_SIGNUM_OFFSET, signum - 1);
-  Platform.putObjectVolatile(v, Platform.BIG_INTEGER_MAG_OFFSET, mag);
-  return Decimal.apply(new BigDecimal(v, scale), precision, scale);
+  byte[] bytes = getBinary(ordinal);
+  BigInteger bigInteger = new BigInteger(bytes);
+  BigDecimal javaDecimal = new BigDecimal(bigInteger, scale);
+  return Decimal.apply(javaDecimal, precision, scale);
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/270ee677/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
index 005351f..2f43db6 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
+++

spark git commit: [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/master b4b35f133 - 1aeae05bb


[SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of 
generate blocks fills up to capacity

Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if 
that queue fills up to capacity (default is 10 blocks), then the inserting into 
queue (done in the function updateCurrentBuffer) get blocked inside a 
synchronized block. However, the thread that is pulling blocks from the queue 
uses the same lock to check the current (active or stopped) while pulling from 
the queue. Since the block generating threads is blocked (as the queue is full) 
on the lock, this thread that is supposed to drain the queue gets blocked. 
Ergo, deadlock.

Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to 
prevent deadlock.

Author: Tathagata Das tathagata.das1...@gmail.com

Closes #8257 from tdas/SPARK-10072.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1aeae05b
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1aeae05b
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1aeae05b

Branch: refs/heads/master
Commit: 1aeae05bb20f01ab7ccaa62fe905a63e020074b5
Parents: b4b35f1
Author: Tathagata Das tathagata.das1...@gmail.com
Authored: Tue Aug 18 19:26:38 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 19:26:38 2015 -0700

--
 .../streaming/receiver/BlockGenerator.scala | 29 +---
 1 file changed, 19 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1aeae05b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
index 300e820..421d60a 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
@@ -227,16 +227,21 @@ private[streaming] class BlockGenerator(
   def isStopped(): Boolean = state == StoppedAll
 
   /** Change the buffer to which single records are added to. */
-  private def updateCurrentBuffer(time: Long): Unit = synchronized {
+  private def updateCurrentBuffer(time: Long): Unit = {
 try {
-  val newBlockBuffer = currentBuffer
-  currentBuffer = new ArrayBuffer[Any]
-  if (newBlockBuffer.size  0) {
-val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
-val newBlock = new Block(blockId, newBlockBuffer)
-listener.onGenerateBlock(blockId)
+  var newBlock: Block = null
+  synchronized {
+if (currentBuffer.nonEmpty) {
+  val newBlockBuffer = currentBuffer
+  currentBuffer = new ArrayBuffer[Any]
+  val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
+  listener.onGenerateBlock(blockId)
+  newBlock = new Block(blockId, newBlockBuffer)
+}
+  }
+
+  if (newBlock != null) {
 blocksForPushing.put(newBlock)  // put is blocking when queue is full
-logDebug(Last element in  + blockId +  is  + newBlockBuffer.last)
   }
 } catch {
   case ie: InterruptedException =
@@ -250,9 +255,13 @@ private[streaming] class BlockGenerator(
   private def keepPushingBlocks() {
 logInfo(Started block pushing thread)
 
-def isGeneratingBlocks = synchronized { state == Active || state == 
StoppedAddingData }
+def areBlocksBeingGenerated: Boolean = synchronized {
+  state != StoppedGeneratingBlocks
+}
+
 try {
-  while (isGeneratingBlocks) {
+  // While blocks are being generated, keep polling for to-be-pushed 
blocks and push them.
+  while (areBlocksBeingGenerated) {
 Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
   case Some(block) = pushBlock(block)
   case None =


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 0a1385e31 - 08c5962a2


[SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of 
generate blocks fills up to capacity

Generated blocks are inserted into an ArrayBlockingQueue, and another thread 
pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if 
that queue fills up to capacity (default is 10 blocks), then the inserting into 
queue (done in the function updateCurrentBuffer) get blocked inside a 
synchronized block. However, the thread that is pulling blocks from the queue 
uses the same lock to check the current (active or stopped) while pulling from 
the queue. Since the block generating threads is blocked (as the queue is full) 
on the lock, this thread that is supposed to drain the queue gets blocked. 
Ergo, deadlock.

Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to 
prevent deadlock.

Author: Tathagata Das tathagata.das1...@gmail.com

Closes #8257 from tdas/SPARK-10072.

(cherry picked from commit 1aeae05bb20f01ab7ccaa62fe905a63e020074b5)
Signed-off-by: Tathagata Das tathagata.das1...@gmail.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/08c5962a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/08c5962a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/08c5962a

Branch: refs/heads/branch-1.5
Commit: 08c5962a251555e7d34460135ab6c32cce584704
Parents: 0a1385e
Author: Tathagata Das tathagata.das1...@gmail.com
Authored: Tue Aug 18 19:26:38 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 19:26:51 2015 -0700

--
 .../streaming/receiver/BlockGenerator.scala | 29 +---
 1 file changed, 19 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/08c5962a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
index 300e820..421d60a 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala
@@ -227,16 +227,21 @@ private[streaming] class BlockGenerator(
   def isStopped(): Boolean = state == StoppedAll
 
   /** Change the buffer to which single records are added to. */
-  private def updateCurrentBuffer(time: Long): Unit = synchronized {
+  private def updateCurrentBuffer(time: Long): Unit = {
 try {
-  val newBlockBuffer = currentBuffer
-  currentBuffer = new ArrayBuffer[Any]
-  if (newBlockBuffer.size  0) {
-val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
-val newBlock = new Block(blockId, newBlockBuffer)
-listener.onGenerateBlock(blockId)
+  var newBlock: Block = null
+  synchronized {
+if (currentBuffer.nonEmpty) {
+  val newBlockBuffer = currentBuffer
+  currentBuffer = new ArrayBuffer[Any]
+  val blockId = StreamBlockId(receiverId, time - blockIntervalMs)
+  listener.onGenerateBlock(blockId)
+  newBlock = new Block(blockId, newBlockBuffer)
+}
+  }
+
+  if (newBlock != null) {
 blocksForPushing.put(newBlock)  // put is blocking when queue is full
-logDebug(Last element in  + blockId +  is  + newBlockBuffer.last)
   }
 } catch {
   case ie: InterruptedException =
@@ -250,9 +255,13 @@ private[streaming] class BlockGenerator(
   private def keepPushingBlocks() {
 logInfo(Started block pushing thread)
 
-def isGeneratingBlocks = synchronized { state == Active || state == 
StoppedAddingData }
+def areBlocksBeingGenerated: Boolean = synchronized {
+  state != StoppedGeneratingBlocks
+}
+
 try {
-  while (isGeneratingBlocks) {
+  // While blocks are being generated, keep polling for to-be-pushed 
blocks and push them.
+  while (areBlocksBeingGenerated) {
 Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match {
   case Some(block) = pushBlock(block)
   case None =


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites

2015-08-18 Thread lian

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 a6f8979c8 - bb2fb59f9


[SPARK-9939] [SQL] Resorts to Java process API in CliSuite, 
HiveSparkSubmitSuite and HiveThriftServer2 test suites

Scala process API has a known bug ([SI-8768] [1]), which may be the reason why 
several test suites which fork sub-processes are flaky.

This PR replaces Scala process API with Java process API in `CliSuite`, 
`HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see 
whether it fix these flaky tests.

[1]: https://issues.scala-lang.org/browse/SI-8768

Author: Cheng Lian l...@databricks.com

Closes #8168 from liancheng/spark-9939/use-java-process-api.

(cherry picked from commit a5b5b936596ceb45f5f5b68bf1d6368534fb9470)
Signed-off-by: Cheng Lian l...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb2fb59f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb2fb59f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb2fb59f

Branch: refs/heads/branch-1.5
Commit: bb2fb59f9d94f2bf175e06eae87ccefbdbbbf724
Parents: a6f8979
Author: Cheng Lian l...@databricks.com
Authored: Wed Aug 19 11:21:46 2015 +0800
Committer: Cheng Lian l...@databricks.com
Committed: Wed Aug 19 11:22:31 2015 +0800

--
 .../spark/sql/test/ProcessTestUtils.scala   | 37 +
 sql/hive-thriftserver/pom.xml   |  7 ++
 .../spark/sql/hive/thriftserver/CliSuite.scala  | 64 ++-
 .../thriftserver/HiveThriftServer2Suites.scala  | 83 +++-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala   | 49 
 5 files changed, 149 insertions(+), 91 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
new file mode 100644
index 000..152c9c8
--- /dev/null
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.test
+
+import java.io.{IOException, InputStream}
+
+import scala.sys.process.BasicIO
+
+object ProcessTestUtils {
+  class ProcessOutputCapturer(stream: InputStream, capture: String = Unit) 
extends Thread {
+this.setDaemon(true)
+
+override def run(): Unit = {
+  try {
+BasicIO.processFully(capture)(stream)
+  } catch { case _: IOException =
+// Ignores the IOException thrown when the process termination, which 
closes the input
+// stream abruptly.
+  }
+}
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/hive-thriftserver/pom.xml
--
diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml
index 2dfbcb2..3566c87 100644
--- a/sql/hive-thriftserver/pom.xml
+++ b/sql/hive-thriftserver/pom.xml
@@ -86,6 +86,13 @@
   artifactIdselenium-java/artifactId
   scopetest/scope
 /dependency
+dependency
+  groupIdorg.apache.spark/groupId
+  artifactIdspark-sql_${scala.binary.version}/artifactId
+  typetest-jar/type
+  version${project.version}/version
+  scopetest/scope
+/dependency
   /dependencies
   build
 
outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory

http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
--
diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
index 121b3e0..e59a14e 100644
---

spark git commit: [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites

2015-08-18 Thread lian

Repository: spark
Updated Branches:
  refs/heads/master 90273eff9 - a5b5b9365


[SPARK-9939] [SQL] Resorts to Java process API in CliSuite, 
HiveSparkSubmitSuite and HiveThriftServer2 test suites

Scala process API has a known bug ([SI-8768] [1]), which may be the reason why 
several test suites which fork sub-processes are flaky.

This PR replaces Scala process API with Java process API in `CliSuite`, 
`HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see 
whether it fix these flaky tests.

[1]: https://issues.scala-lang.org/browse/SI-8768

Author: Cheng Lian l...@databricks.com

Closes #8168 from liancheng/spark-9939/use-java-process-api.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5b5b936
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5b5b936
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5b5b936

Branch: refs/heads/master
Commit: a5b5b936596ceb45f5f5b68bf1d6368534fb9470
Parents: 90273ef
Author: Cheng Lian l...@databricks.com
Authored: Wed Aug 19 11:21:46 2015 +0800
Committer: Cheng Lian l...@databricks.com
Committed: Wed Aug 19 11:21:46 2015 +0800

--
 .../spark/sql/test/ProcessTestUtils.scala   | 37 +
 sql/hive-thriftserver/pom.xml   |  7 ++
 .../spark/sql/hive/thriftserver/CliSuite.scala  | 64 ++-
 .../thriftserver/HiveThriftServer2Suites.scala  | 83 +++-
 .../spark/sql/hive/HiveSparkSubmitSuite.scala   | 49 
 5 files changed, 149 insertions(+), 91 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
new file mode 100644
index 000..152c9c8
--- /dev/null
+++ b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.test
+
+import java.io.{IOException, InputStream}
+
+import scala.sys.process.BasicIO
+
+object ProcessTestUtils {
+  class ProcessOutputCapturer(stream: InputStream, capture: String = Unit) 
extends Thread {
+this.setDaemon(true)
+
+override def run(): Unit = {
+  try {
+BasicIO.processFully(capture)(stream)
+  } catch { case _: IOException =
+// Ignores the IOException thrown when the process termination, which 
closes the input
+// stream abruptly.
+  }
+}
+  }
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/hive-thriftserver/pom.xml
--
diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml
index 2dfbcb2..3566c87 100644
--- a/sql/hive-thriftserver/pom.xml
+++ b/sql/hive-thriftserver/pom.xml
@@ -86,6 +86,13 @@
   artifactIdselenium-java/artifactId
   scopetest/scope
 /dependency
+dependency
+  groupIdorg.apache.spark/groupId
+  artifactIdspark-sql_${scala.binary.version}/artifactId
+  typetest-jar/type
+  version${project.version}/version
+  scopetest/scope
+/dependency
   /dependencies
   build
 
outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory

http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
--
diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
index 121b3e0..e59a14e 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala
+++

spark git commit: [SPARKR] [MINOR] Get rid of a long line warning

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 1f8902964 - b4b35f133


[SPARKR] [MINOR] Get rid of a long line warning

```
R/functions.R:74:1: style: lines should not be more than 100 characters.
jc - callJStatic(org.apache.spark.sql.functions, lit, 
ifelse(class(x) == Column, xjc, x))
^
```

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8297 from yu-iskw/minor-lint-r.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4b35f13
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4b35f13
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4b35f13

Branch: refs/heads/master
Commit: b4b35f133aecaf84f04e8e444b660a33c6b7894a
Parents: 1f89029
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 19:18:05 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 19:18:05 2015 -0700

--
 R/pkg/R/functions.R | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b4b35f13/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 6eef4d6..e606b20 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -71,7 +71,9 @@ createFunctions()
 #' @return Creates a Column class of literal value.
 setMethod(lit, signature(ANY),
   function(x) {
-jc - callJStatic(org.apache.spark.sql.functions, lit, 
ifelse(class(x) == Column, x@jc, x))
+jc - callJStatic(org.apache.spark.sql.functions,
+  lit,
+  ifelse(class(x) == Column, x@jc, x))
 column(jc)
   })
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 08c5962a2 - a6f8979c8


[SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen 
before setting trackerState to Started

Test failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/

There is a race condition that setting `trackerState` to `Started` could happen 
after calling `startReceiver`. Then `startReceiver` won't start the receivers 
because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or 
stopped. But actually, `trackerState` is `Initialized` and will be changed to 
`Started` soon.

Therefore, we should use `isTrackerStopping || isTrackerStopped`.

Author: zsxwing zsxw...@gmail.com

Closes #8294 from zsxwing/SPARK-9504.

(cherry picked from commit 90273eff9604439a5a5853077e232d34555c67d7)
Signed-off-by: Tathagata Das tathagata.das1...@gmail.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6f8979c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6f8979c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6f8979c

Branch: refs/heads/branch-1.5
Commit: a6f8979c81c5355759f74e8b3c9eb3cafb6a9c7f
Parents: 08c5962
Author: zsxwing zsxw...@gmail.com
Authored: Tue Aug 18 20:15:54 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 20:16:18 2015 -0700

--
 .../spark/streaming/scheduler/ReceiverTracker.scala  | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a6f8979c/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
index e076fb5..aae3acf 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
@@ -468,8 +468,13 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
  * Start a receiver along with its scheduled executors
  */
 private def startReceiver(receiver: Receiver[_], scheduledExecutors: 
Seq[String]): Unit = {
+  def shouldStartReceiver: Boolean = {
+// It's okay to start when trackerState is Initialized or Started
+!(isTrackerStopping || isTrackerStopped)
+  }
+
   val receiverId = receiver.streamId
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
 return
   }
@@ -494,14 +499,14 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
   // We will keep restarting the receiver job until ReceiverTracker is 
stopped
   future.onComplete {
 case Success(_) =
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
   } else {
 logInfo(sRestarting Receiver $receiverId)
 self.send(RestartReceiver(receiver))
   }
 case Failure(e) =
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
   } else {
 logError(Receiver has been stopped. Try to restart it., e)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/master 1aeae05bb - 90273eff9


[SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen 
before setting trackerState to Started

Test failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/

There is a race condition that setting `trackerState` to `Started` could happen 
after calling `startReceiver`. Then `startReceiver` won't start the receivers 
because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or 
stopped. But actually, `trackerState` is `Initialized` and will be changed to 
`Started` soon.

Therefore, we should use `isTrackerStopping || isTrackerStopped`.

Author: zsxwing zsxw...@gmail.com

Closes #8294 from zsxwing/SPARK-9504.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90273eff
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90273eff
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90273eff

Branch: refs/heads/master
Commit: 90273eff9604439a5a5853077e232d34555c67d7
Parents: 1aeae05
Author: zsxwing zsxw...@gmail.com
Authored: Tue Aug 18 20:15:54 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 20:15:54 2015 -0700

--
 .../spark/streaming/scheduler/ReceiverTracker.scala  | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/90273eff/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
--
diff --git 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
index e076fb5..aae3acf 100644
--- 
a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
+++ 
b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
@@ -468,8 +468,13 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
  * Start a receiver along with its scheduled executors
  */
 private def startReceiver(receiver: Receiver[_], scheduledExecutors: 
Seq[String]): Unit = {
+  def shouldStartReceiver: Boolean = {
+// It's okay to start when trackerState is Initialized or Started
+!(isTrackerStopping || isTrackerStopped)
+  }
+
   val receiverId = receiver.streamId
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
 return
   }
@@ -494,14 +499,14 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
   // We will keep restarting the receiver job until ReceiverTracker is 
stopped
   future.onComplete {
 case Success(_) =
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
   } else {
 logInfo(sRestarting Receiver $receiverId)
 self.send(RestartReceiver(receiver))
   }
 case Failure(e) =
-  if (!isTrackerStarted) {
+  if (!shouldStartReceiver) {
 onReceiverJobFinish(receiverId)
   } else {
 logError(Receiver has been stopped. Try to restart it., e)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master a5b5b9365 - bf32c1f7f


[SPARK-10075] [SPARKR] Add `when` expressino function in SparkR

- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`

Since R doesn't support a feature like method chaining, 
`otherwise(when(condition, value), value)` style is a little annoying for me. 
If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?

### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-10075)

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8266 from yu-iskw/SPARK-10075.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf32c1f7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf32c1f7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf32c1f7

Branch: refs/heads/master
Commit: bf32c1f7f47dd907d787469f979c5859e02ce5e6
Parents: a5b5b93
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 20:27:36 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 20:27:36 2015 -0700

--
 R/pkg/NAMESPACE  |  2 ++
 R/pkg/R/column.R | 14 ++
 R/pkg/R/functions.R  | 14 ++
 R/pkg/R/generics.R   |  8 
 R/pkg/inst/tests/test_sparkSQL.R |  7 +++
 5 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 607aef2..8fa12d5 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -152,6 +152,7 @@ exportMethods(abs,
   n_distinct,
   nanvl,
   negate,
+  otherwise,
   pmod,
   quarter,
   reverse,
@@ -182,6 +183,7 @@ exportMethods(abs,
   unhex,
   upper,
   weekofyear,
+  when,
   year)
 
 exportClasses(GroupedData)

http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/column.R
--
diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R
index 328f595..5a07ebd 100644
--- a/R/pkg/R/column.R
+++ b/R/pkg/R/column.R
@@ -203,3 +203,17 @@ setMethod(%in%,
 jc - callJMethod(x@jc, in, table)
 return(column(jc))
   })
+
+#' otherwise
+#'
+#' If values in the specified column are null, returns the value. 
+#' Can be used in conjunction with `when` to specify a default value for 
expressions.
+#'
+#' @rdname column
+setMethod(otherwise,
+  signature(x = Column, value = ANY),
+  function(x, value) {
+value - ifelse(class(value) == Column, value@jc, value)
+jc - callJMethod(x@jc, otherwise, value)
+column(jc)
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index e606b20..366c230 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -165,3 +165,17 @@ setMethod(n, signature(x = Column),
   function(x) {
 count(x)
   })
+
+#' when
+#'
+#' Evaluates a list of conditions and returns one of multiple possible result 
expressions.
+#' For unmatched expressions null is returned.
+#'
+#' @rdname column
+setMethod(when, signature(condition = Column, value = ANY),
+  function(condition, value) {
+  condition - condition@jc
+  value - ifelse(class(value) == Column, value@jc, value)
+  jc - callJStatic(org.apache.spark.sql.functions, when, 
condition, value)
+  column(jc)
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 5c1cc98..338b32e 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -651,6 +651,14 @@ setGeneric(rlike, function(x, ...) { 
standardGeneric(rlike) })
 #' @export
 setGeneric(startsWith, function(x, ...) { standardGeneric(startsWith) })
 
+#' @rdname column
+#' @export
+setGeneric(when, function(condition, value) { standardGeneric(when) })
+
+#' @rdname column
+#' @export
+setGeneric(otherwise, function(x, value) { standardGeneric(otherwise) })
+
 
 ## Expression Function Methods ##
 

http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/inst/tests/test_sparkSQL.R

spark git commit: [SPARK-10095] [SQL] use public API of BigInteger

2015-08-18 Thread davies

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 ebaeb1892 - 11c933505


[SPARK-10095] [SQL] use public API of BigInteger

In UnsafeRow, we use the private field of BigInteger for better performance, 
but it actually didn't contribute much (3% in one benchmark) to end-to-end 
runtime, and make it not portable (may fail on other JVM implementations).

So we should use the public API instead.

cc rxin

Author: Davies Liu dav...@databricks.com

Closes #8286 from davies/portable_decimal.

(cherry picked from commit 270ee677750a1f2adaf24b5816857194e61782ff)
Signed-off-by: Davies Liu davies@gmail.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11c93350
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11c93350
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11c93350

Branch: refs/heads/branch-1.5
Commit: 11c93350589f682f5713049b959a51549acc9aca
Parents: ebaeb18
Author: Davies Liu dav...@databricks.com
Authored: Tue Aug 18 20:39:59 2015 -0700
Committer: Davies Liu davies@gmail.com
Committed: Tue Aug 18 20:40:12 2015 -0700

--
 .../sql/catalyst/expressions/UnsafeRow.java | 29 ++--
 .../catalyst/expressions/UnsafeRowWriters.java  |  9 ++
 .../java/org/apache/spark/unsafe/Platform.java  | 18 
 3 files changed, 11 insertions(+), 45 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/11c93350/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
index 7fd9477..6c02004 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java
@@ -273,14 +273,13 @@ public final class UnsafeRow extends MutableRow {
   } else {
 
 final BigInteger integer = value.toJavaBigDecimal().unscaledValue();
-final int[] mag = (int[]) Platform.getObjectVolatile(integer,
-  Platform.BIG_INTEGER_MAG_OFFSET);
-assert(mag.length = 4);
+byte[] bytes = integer.toByteArray();
+assert(bytes.length = 16);
 
 // Write the bytes to the variable length portion.
 Platform.copyMemory(
-  mag, Platform.INT_ARRAY_OFFSET, baseObject, baseOffset + cursor, 
mag.length * 4);
-setLong(ordinal, (cursor  32) | ((long) (((integer.signum() + 1)  
8) + mag.length)));
+  bytes, Platform.BYTE_ARRAY_OFFSET, baseObject, baseOffset + cursor, 
bytes.length);
+setLong(ordinal, (cursor  32) | ((long) bytes.length));
   }
 }
   }
@@ -375,8 +374,6 @@ public final class UnsafeRow extends MutableRow {
 return Platform.getDouble(baseObject, getFieldOffset(ordinal));
   }
 
-  private static byte[] EMPTY = new byte[0];
-
   @Override
   public Decimal getDecimal(int ordinal, int precision, int scale) {
 if (isNullAt(ordinal)) {
@@ -385,20 +382,10 @@ public final class UnsafeRow extends MutableRow {
 if (precision = Decimal.MAX_LONG_DIGITS()) {
   return Decimal.apply(getLong(ordinal), precision, scale);
 } else {
-  long offsetAndSize = getLong(ordinal);
-  long offset = offsetAndSize  32;
-  int signum = ((int) (offsetAndSize  0xfff)  8);
-  assert signum =0  signum = 2 : invalid signum  + signum;
-  int size = (int) (offsetAndSize  0xff);
-  int[] mag = new int[size];
-  Platform.copyMemory(
-baseObject, baseOffset + offset, mag, Platform.INT_ARRAY_OFFSET, size 
* 4);
-
-  // create a BigInteger using signum and mag
-  BigInteger v = new BigInteger(0, EMPTY);  // create the initial object
-  Platform.putInt(v, Platform.BIG_INTEGER_SIGNUM_OFFSET, signum - 1);
-  Platform.putObjectVolatile(v, Platform.BIG_INTEGER_MAG_OFFSET, mag);
-  return Decimal.apply(new BigDecimal(v, scale), precision, scale);
+  byte[] bytes = getBinary(ordinal);
+  BigInteger bigInteger = new BigInteger(bytes);
+  BigDecimal javaDecimal = new BigDecimal(bigInteger, scale);
+  return Decimal.apply(javaDecimal, precision, scale);
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/11c93350/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
--
diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java
index 005351f..2f43db6 100644
---

spark git commit: [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 bb2fb59f9 - ebaeb1892


[SPARK-10075] [SPARKR] Add `when` expressino function in SparkR

- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`

Since R doesn't support a feature like method chaining, 
`otherwise(when(condition, value), value)` style is a little annoying for me. 
If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?

### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF 
JIRA](https://issues.apache.org/jira/browse/SPARK-10075)

Author: Yu ISHIKAWA yuu.ishik...@gmail.com

Closes #8266 from yu-iskw/SPARK-10075.

(cherry picked from commit bf32c1f7f47dd907d787469f979c5859e02ce5e6)
Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebaeb189
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebaeb189
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebaeb189

Branch: refs/heads/branch-1.5
Commit: ebaeb189260dd338fc5a91d8ec3ff6d45989991a
Parents: bb2fb59
Author: Yu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 20:27:36 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 20:29:34 2015 -0700

--
 R/pkg/NAMESPACE  |  2 ++
 R/pkg/R/column.R | 14 ++
 R/pkg/R/functions.R  | 14 ++
 R/pkg/R/generics.R   |  8 
 R/pkg/inst/tests/test_sparkSQL.R |  7 +++
 5 files changed, 45 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 607aef2..8fa12d5 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -152,6 +152,7 @@ exportMethods(abs,
   n_distinct,
   nanvl,
   negate,
+  otherwise,
   pmod,
   quarter,
   reverse,
@@ -182,6 +183,7 @@ exportMethods(abs,
   unhex,
   upper,
   weekofyear,
+  when,
   year)
 
 exportClasses(GroupedData)

http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/column.R
--
diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R
index 328f595..5a07ebd 100644
--- a/R/pkg/R/column.R
+++ b/R/pkg/R/column.R
@@ -203,3 +203,17 @@ setMethod(%in%,
 jc - callJMethod(x@jc, in, table)
 return(column(jc))
   })
+
+#' otherwise
+#'
+#' If values in the specified column are null, returns the value. 
+#' Can be used in conjunction with `when` to specify a default value for 
expressions.
+#'
+#' @rdname column
+setMethod(otherwise,
+  signature(x = Column, value = ANY),
+  function(x, value) {
+value - ifelse(class(value) == Column, value@jc, value)
+jc - callJMethod(x@jc, otherwise, value)
+column(jc)
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/functions.R
--
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index e606b20..366c230 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -165,3 +165,17 @@ setMethod(n, signature(x = Column),
   function(x) {
 count(x)
   })
+
+#' when
+#'
+#' Evaluates a list of conditions and returns one of multiple possible result 
expressions.
+#' For unmatched expressions null is returned.
+#'
+#' @rdname column
+setMethod(when, signature(condition = Column, value = ANY),
+  function(condition, value) {
+  condition - condition@jc
+  value - ifelse(class(value) == Column, value@jc, value)
+  jc - callJStatic(org.apache.spark.sql.functions, when, 
condition, value)
+  column(jc)
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 5c1cc98..338b32e 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -651,6 +651,14 @@ setGeneric(rlike, function(x, ...) { 
standardGeneric(rlike) })
 #' @export
 setGeneric(startsWith, function(x, ...) { standardGeneric(startsWith) })
 
+#' @rdname column
+#' @export
+setGeneric(when, function(condition, value) { standardGeneric(when) })
+
+#' @rdname column
+#' @export
+setGeneric(otherwise, function(x, value) { standardGeneric(otherwise) })
+
 
 ## Expression Function Methods

spark git commit: [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/master 8bae9015b - bf1d6614d


[SPARK-9574] [STREAMING] Remove unnecessary contents of 
spark-streaming-XXX-assembly jars

Removed contents already included in Spark assembly jar from 
spark-streaming-XXX-assembly jars.

Author: zsxwing zsxw...@gmail.com

Closes #8069 from zsxwing/SPARK-9574.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf1d6614
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf1d6614
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf1d6614

Branch: refs/heads/master
Commit: bf1d6614dcb8f5974e62e406d9c0f8aac52556d3
Parents: 8bae901
Author: zsxwing zsxw...@gmail.com
Authored: Tue Aug 18 13:35:45 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 13:35:45 2015 -0700

--
 external/flume-assembly/pom.xml | 11 +
 external/kafka-assembly/pom.xml | 84 
 external/mqtt-assembly/pom.xml  | 74 
 extras/kinesis-asl-assembly/pom.xml | 79 ++
 pom.xml |  2 +-
 5 files changed, 249 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bf1d6614/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index 1318959..e05e431 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -69,6 +69,11 @@
   scopeprovided/scope
 /dependency
 dependency
+  groupIdcommons-lang/groupId
+  artifactIdcommons-lang/artifactId
+  scopeprovided/scope
+/dependency
+dependency
   groupIdcommons-net/groupId
   artifactIdcommons-net/artifactId
   scopeprovided/scope
@@ -89,6 +94,12 @@
   scopeprovided/scope
 /dependency
 dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro-mapred/artifactId
+  classifier${avro.mapred.classifier}/classifier
+  scopeprovided/scope
+/dependency
+dependency
   groupIdorg.scala-lang/groupId
   artifactIdscala-library/artifactId
   scopeprovided/scope

http://git-wip-us.apache.org/repos/asf/spark/blob/bf1d6614/external/kafka-assembly/pom.xml
--
diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml
index 977514f..36342f3 100644
--- a/external/kafka-assembly/pom.xml
+++ b/external/kafka-assembly/pom.xml
@@ -47,6 +47,90 @@
   version${project.version}/version
   scopeprovided/scope
 /dependency
+!--
+  Demote already included in the Spark assembly.
+--
+dependency
+  groupIdcommons-codec/groupId
+  artifactIdcommons-codec/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcommons-lang/groupId
+  artifactIdcommons-lang/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.google.protobuf/groupId
+  artifactIdprotobuf-java/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.sun.jersey/groupId
+  artifactIdjersey-server/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.sun.jersey/groupId
+  artifactIdjersey-core/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdnet.jpountz.lz4/groupId
+  artifactIdlz4/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.hadoop/groupId
+  artifactIdhadoop-client/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro-mapred/artifactId
+  classifier${avro.mapred.classifier}/classifier
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.curator/groupId
+  artifactIdcurator-recipes/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.zookeeper/groupId
+  artifactIdzookeeper/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdlog4j/groupId
+  artifactIdlog4j/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdnet.java.dev.jets3t/groupId
+  artifactIdjets3t/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.scala-lang/groupId
+  artifactIdscala-library/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.slf4j/groupId
+  artifactIdslf4j-api/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.slf4j/groupId
+  artifactIdslf4j-log4j12/artifactId

spark git commit: [SPARK-9782] [YARN] Support YARN application tags via SparkConf

2015-08-18 Thread sandy

Repository: spark
Updated Branches:
  refs/heads/master 80cb25b22 - 9b731fad2


[SPARK-9782] [YARN] Support YARN application tags via SparkConf

Add a new test case in yarn/ClientSuite which checks how the various SparkConf
and ClientArguments propagate into the ApplicationSubmissionContext.

Author: Dennis Huo d...@google.com

Closes #8072 from dennishuo/dhuo-yarn-application-tags.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b731fad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b731fad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b731fad

Branch: refs/heads/master
Commit: 9b731fad2b43ca18f3c5274062d4c7bc2622ab72
Parents: 80cb25b
Author: Dennis Huo d...@google.com
Authored: Tue Aug 18 14:34:20 2015 -0700
Committer: Sandy Ryza sa...@cloudera.com
Committed: Tue Aug 18 14:34:20 2015 -0700

--
 docs/running-on-yarn.md |  8 +
 .../org/apache/spark/deploy/yarn/Client.scala   | 21 
 .../apache/spark/deploy/yarn/ClientSuite.scala  | 36 
 3 files changed, 65 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/docs/running-on-yarn.md
--
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index ec32c41..8ac26e9 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -320,6 +320,14 @@ If you need a reference to the proper location to put log 
files in the YARN so t
   /td
 /tr
 tr
+  tdcodespark.yarn.tags/code/td
+  td(none)/td
+  td
+  Comma-separated list of strings to pass through as YARN application tags 
appearing
+  in YARN ApplicationReports, which can be used for filtering when querying 
YARN apps.
+  /td
+/tr
+tr
   tdcodespark.yarn.keytab/code/td
   td(none)/td
   td

http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
--
diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 
b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
index 6d63dda..5c6a716 100644
--- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
+++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
@@ -163,6 +163,23 @@ private[spark] class Client(
 appContext.setQueue(args.amQueue)
 appContext.setAMContainerSpec(containerContext)
 appContext.setApplicationType(SPARK)
+sparkConf.getOption(CONF_SPARK_YARN_APPLICATION_TAGS)
+  .map(StringUtils.getTrimmedStringCollection(_))
+  .filter(!_.isEmpty())
+  .foreach { tagCollection =
+try {
+  // The setApplicationTags method was only introduced in Hadoop 2.4+, 
so we need to use
+  // reflection to set it, printing a warning if a tag was specified 
but the YARN version
+  // doesn't support it.
+  val method = appContext.getClass().getMethod(
+setApplicationTags, classOf[java.util.Set[String]])
+  method.invoke(appContext, new 
java.util.HashSet[String](tagCollection))
+} catch {
+  case e: NoSuchMethodException =
+logWarning(sIgnoring $CONF_SPARK_YARN_APPLICATION_TAGS because 
this version of  +
+  YARN does not support it)
+}
+  }
 sparkConf.getOption(spark.yarn.maxAppAttempts).map(_.toInt) match {
   case Some(v) = appContext.setMaxAppAttempts(v)
   case None = logDebug(spark.yarn.maxAppAttempts is not set.  +
@@ -987,6 +1004,10 @@ object Client extends Logging {
   // of the executors
   val CONF_SPARK_YARN_SECONDARY_JARS = spark.yarn.secondary.jars
 
+  // Comma-separated list of strings to pass through as YARN application tags 
appearing
+  // in YARN ApplicationReports, which can be used for filtering when querying 
YARN.
+  val CONF_SPARK_YARN_APPLICATION_TAGS = spark.yarn.tags
+
   // Staging directory is private! - rwx
   val STAGING_DIR_PERMISSION: FsPermission =
 FsPermission.createImmutable(Integer.parseInt(700, 8).toShort)

http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
--
diff --git a/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala 
b/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
index 837f8d3..0a5402c 100644
--- a/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
+++ b/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala
@@ -29,8 +29,11 @@ import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.Path
 import org.apache.hadoop.mapreduce.MRJobConfig
 import

spark git commit: [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 74a6b1a13 - 8b0df5a5e


[SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.

Author: Marcelo Vanzin van...@cloudera.com

Closes #8282 from vanzin/SPARK-10088.

(cherry picked from commit 492ac1facbc79ee251d45cff315598ec9935a0e2)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8b0df5a5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8b0df5a5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8b0df5a5

Branch: refs/heads/branch-1.5
Commit: 8b0df5a5e6dbd90d0741e0f2279222574b8acb20
Parents: 74a6b1a
Author: Marcelo Vanzin van...@cloudera.com
Authored: Tue Aug 18 14:45:19 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 14:45:35 2015 -0700

--
 .../main/scala/org/apache/spark/sql/hive/HiveQl.scala   | 11 +++
 .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 12 ++--
 2 files changed, 13 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8b0df5a5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
--
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
index c3f2935..ad33dee 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
@@ -729,6 +729,17 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 inputFormat = 
Option(org.apache.hadoop.mapred.SequenceFileInputFormat),
 outputFormat = 
Option(org.apache.hadoop.mapred.SequenceFileOutputFormat))
 
+case avro =
+  tableDesc = tableDesc.copy(
+inputFormat =
+  
Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat),
+outputFormat =
+  
Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat))
+  if (tableDesc.serde.isEmpty) {
+tableDesc = tableDesc.copy(
+  serde = 
Option(org.apache.hadoop.hive.serde2.avro.AvroSerDe))
+  }
+
 case _ =
   throw new SemanticException(
 sUnrecognized file format in STORED AS clause: 
${child.getText})

http://git-wip-us.apache.org/repos/asf/spark/blob/8b0df5a5/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
index 4eae699..4da8663 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
@@ -25,10 +25,8 @@ import scala.language.implicitConversions
 
 import org.apache.hadoop.hive.conf.HiveConf.ConfVars
 import org.apache.hadoop.hive.ql.exec.FunctionRegistry
-import org.apache.hadoop.hive.ql.io.avro.{AvroContainerInputFormat, 
AvroContainerOutputFormat}
 import org.apache.hadoop.hive.ql.processors._
 import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
-import org.apache.hadoop.hive.serde2.avro.AvroSerDe
 
 import org.apache.spark.sql.SQLConf
 import org.apache.spark.sql.catalyst.analysis._
@@ -276,10 +274,7 @@ class TestHiveContext(sc: SparkContext) extends 
HiveContext(sc) {
   INSERT OVERWRITE TABLE serdeins SELECT * FROM src.cmd),
 TestTable(episodes,
   sCREATE TABLE episodes (title STRING, air_date STRING, doctor INT)
- |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}'
- |STORED AS
- |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}'
- |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}'
+ |STORED AS avro
  |TBLPROPERTIES (
  |  'avro.schema.literal'='{
  |type: record,
@@ -312,10 +307,7 @@ class TestHiveContext(sc: SparkContext) extends 
HiveContext(sc) {
 TestTable(episodes_part,
   sCREATE TABLE episodes_part (title STRING, air_date STRING, doctor 
INT)
  |PARTITIONED BY (doctor_pt INT)
- |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}'
- |STORED AS
- |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}'
- |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}'
+ |STORED AS avro
  |TBLPROPERTIES (
  |  'avro.schema.literal'='{
  |type: record,

spark git commit: [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.

2015-08-18 Thread marmbrus

Repository: spark
Updated Branches:
  refs/heads/master fa41e0242 - 492ac1fac


[SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.

Author: Marcelo Vanzin van...@cloudera.com

Closes #8282 from vanzin/SPARK-10088.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/492ac1fa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/492ac1fa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/492ac1fa

Branch: refs/heads/master
Commit: 492ac1facbc79ee251d45cff315598ec9935a0e2
Parents: fa41e02
Author: Marcelo Vanzin van...@cloudera.com
Authored: Tue Aug 18 14:45:19 2015 -0700
Committer: Michael Armbrust mich...@databricks.com
Committed: Tue Aug 18 14:45:19 2015 -0700

--
 .../main/scala/org/apache/spark/sql/hive/HiveQl.scala   | 11 +++
 .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 12 ++--
 2 files changed, 13 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/492ac1fa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
--
diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
index c3f2935..ad33dee 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
@@ -729,6 +729,17 @@ 
https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
 inputFormat = 
Option(org.apache.hadoop.mapred.SequenceFileInputFormat),
 outputFormat = 
Option(org.apache.hadoop.mapred.SequenceFileOutputFormat))
 
+case avro =
+  tableDesc = tableDesc.copy(
+inputFormat =
+  
Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat),
+outputFormat =
+  
Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat))
+  if (tableDesc.serde.isEmpty) {
+tableDesc = tableDesc.copy(
+  serde = 
Option(org.apache.hadoop.hive.serde2.avro.AvroSerDe))
+  }
+
 case _ =
   throw new SemanticException(
 sUnrecognized file format in STORED AS clause: 
${child.getText})

http://git-wip-us.apache.org/repos/asf/spark/blob/492ac1fa/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
--
diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
index 4eae699..4da8663 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala
@@ -25,10 +25,8 @@ import scala.language.implicitConversions
 
 import org.apache.hadoop.hive.conf.HiveConf.ConfVars
 import org.apache.hadoop.hive.ql.exec.FunctionRegistry
-import org.apache.hadoop.hive.ql.io.avro.{AvroContainerInputFormat, 
AvroContainerOutputFormat}
 import org.apache.hadoop.hive.ql.processors._
 import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe
-import org.apache.hadoop.hive.serde2.avro.AvroSerDe
 
 import org.apache.spark.sql.SQLConf
 import org.apache.spark.sql.catalyst.analysis._
@@ -276,10 +274,7 @@ class TestHiveContext(sc: SparkContext) extends 
HiveContext(sc) {
   INSERT OVERWRITE TABLE serdeins SELECT * FROM src.cmd),
 TestTable(episodes,
   sCREATE TABLE episodes (title STRING, air_date STRING, doctor INT)
- |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}'
- |STORED AS
- |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}'
- |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}'
+ |STORED AS avro
  |TBLPROPERTIES (
  |  'avro.schema.literal'='{
  |type: record,
@@ -312,10 +307,7 @@ class TestHiveContext(sc: SparkContext) extends 
HiveContext(sc) {
 TestTable(episodes_part,
   sCREATE TABLE episodes_part (title STRING, air_date STRING, doctor 
INT)
  |PARTITIONED BY (doctor_pt INT)
- |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}'
- |STORED AS
- |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}'
- |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}'
+ |STORED AS avro
  |TBLPROPERTIES (
  |  'avro.schema.literal'='{
  |type: record,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands,

spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 8b0df5a5e - 56f4da263


[SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

Added since tags to mllib.tree

Author: Bryan Cutler bjcut...@us.ibm.com

Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.

(cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56f4da26
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56f4da26
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56f4da26

Branch: refs/heads/branch-1.5
Commit: 56f4da2633aab6d1f25c03b1cf567c2c68374fb5
Parents: 8b0df5a
Author: Bryan Cutler bjcut...@us.ibm.com
Authored: Tue Aug 18 14:58:30 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 14:58:37 2015 -0700

--
 .../apache/spark/mllib/tree/DecisionTree.scala  | 13 +++
 .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++
 .../apache/spark/mllib/tree/RandomForest.scala  | 10 ++
 .../spark/mllib/tree/configuration/Algo.scala   |  1 +
 .../tree/configuration/BoostingStrategy.scala   |  6 
 .../mllib/tree/configuration/FeatureType.scala  |  1 +
 .../tree/configuration/QuantileStrategy.scala   |  1 +
 .../mllib/tree/configuration/Strategy.scala | 20 ++-
 .../spark/mllib/tree/impurity/Entropy.scala |  4 +++
 .../apache/spark/mllib/tree/impurity/Gini.scala |  4 +++
 .../spark/mllib/tree/impurity/Impurity.scala|  3 ++
 .../spark/mllib/tree/impurity/Variance.scala|  4 +++
 .../spark/mllib/tree/loss/AbsoluteError.scala   |  2 ++
 .../apache/spark/mllib/tree/loss/LogLoss.scala  |  2 ++
 .../org/apache/spark/mllib/tree/loss/Loss.scala |  3 ++
 .../apache/spark/mllib/tree/loss/Losses.scala   |  6 
 .../spark/mllib/tree/loss/SquaredError.scala|  2 ++
 .../mllib/tree/model/DecisionTreeModel.scala| 22 
 .../mllib/tree/model/InformationGainStats.scala |  1 +
 .../apache/spark/mllib/tree/model/Node.scala|  3 ++
 .../apache/spark/mllib/tree/model/Predict.scala |  1 +
 .../apache/spark/mllib/tree/model/Split.scala   |  1 +
 .../mllib/tree/model/treeEnsembleModels.scala   | 37 
 .../org/apache/spark/mllib/tree/package.scala   |  1 +
 24 files changed, 157 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56f4da26/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index cecd1fe..e5200b8 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom
  * @param strategy The configuration parameters for the tree algorithm which 
specify the type
  * of algorithm (classification, regression, etc.), feature 
type (continuous,
  * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ * @since 1.0.0
  */
 @Experimental
 class DecisionTree (private val strategy: Strategy) extends Serializable with 
Logging {
@@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
* Method to train a decision tree model over an RDD
* @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.2.0
*/
   def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
 // Note: random seed will not be used since numTrees = 1.
@@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
   }
 }
 
+/**
+ * @since 1.0.0
+ */
 object DecisionTree extends Serializable with Logging {
 
   /**
@@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging {
* of algorithm (classification, regression, etc.), feature 
type (continuous,
* categorical), depth of the tree, quantile calculation 
strategy, etc.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
   */
   def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = 
{
 new DecisionTree(strategy).run(input)
@@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging {
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.
* @return DecisionTreeModel that can be used for prediction
+   * @since

spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 492ac1fac - 1dbffba37


[SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

Added since tags to mllib.tree

Author: Bryan Cutler bjcut...@us.ibm.com

Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dbffba3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dbffba3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dbffba3

Branch: refs/heads/master
Commit: 1dbffba37a84c62202befd3911d25888f958191d
Parents: 492ac1f
Author: Bryan Cutler bjcut...@us.ibm.com
Authored: Tue Aug 18 14:58:30 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 14:58:30 2015 -0700

--
 .../apache/spark/mllib/tree/DecisionTree.scala  | 13 +++
 .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++
 .../apache/spark/mllib/tree/RandomForest.scala  | 10 ++
 .../spark/mllib/tree/configuration/Algo.scala   |  1 +
 .../tree/configuration/BoostingStrategy.scala   |  6 
 .../mllib/tree/configuration/FeatureType.scala  |  1 +
 .../tree/configuration/QuantileStrategy.scala   |  1 +
 .../mllib/tree/configuration/Strategy.scala | 20 ++-
 .../spark/mllib/tree/impurity/Entropy.scala |  4 +++
 .../apache/spark/mllib/tree/impurity/Gini.scala |  4 +++
 .../spark/mllib/tree/impurity/Impurity.scala|  3 ++
 .../spark/mllib/tree/impurity/Variance.scala|  4 +++
 .../spark/mllib/tree/loss/AbsoluteError.scala   |  2 ++
 .../apache/spark/mllib/tree/loss/LogLoss.scala  |  2 ++
 .../org/apache/spark/mllib/tree/loss/Loss.scala |  3 ++
 .../apache/spark/mllib/tree/loss/Losses.scala   |  6 
 .../spark/mllib/tree/loss/SquaredError.scala|  2 ++
 .../mllib/tree/model/DecisionTreeModel.scala| 22 
 .../mllib/tree/model/InformationGainStats.scala |  1 +
 .../apache/spark/mllib/tree/model/Node.scala|  3 ++
 .../apache/spark/mllib/tree/model/Predict.scala |  1 +
 .../apache/spark/mllib/tree/model/Split.scala   |  1 +
 .../mllib/tree/model/treeEnsembleModels.scala   | 37 
 .../org/apache/spark/mllib/tree/package.scala   |  1 +
 24 files changed, 157 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1dbffba3/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index cecd1fe..e5200b8 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom
  * @param strategy The configuration parameters for the tree algorithm which 
specify the type
  * of algorithm (classification, regression, etc.), feature 
type (continuous,
  * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ * @since 1.0.0
  */
 @Experimental
 class DecisionTree (private val strategy: Strategy) extends Serializable with 
Logging {
@@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
* Method to train a decision tree model over an RDD
* @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.2.0
*/
   def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
 // Note: random seed will not be used since numTrees = 1.
@@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
   }
 }
 
+/**
+ * @since 1.0.0
+ */
 object DecisionTree extends Serializable with Logging {
 
   /**
@@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging {
* of algorithm (classification, regression, etc.), feature 
type (continuous,
* categorical), depth of the tree, quantile calculation 
strategy, etc.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
   */
   def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = 
{
 new DecisionTree(strategy).run(input)
@@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging {
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
*/
   def train(
   input: RDD[LabeledPoint],
@@ -127,6 +134,7 @@ object DecisionTree extends Serializable with

spark git commit: Bump SparkR version string to 1.5.0

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master badf7fa65 - 04e0fea79


Bump SparkR version string to 1.5.0

This patch is against master, but we need to apply it to 1.5 branch as well.

cc shivaram  and rxin

Author: Hossein hoss...@databricks.com

Closes #8291 from falaki/SparkRVersion1.5.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04e0fea7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04e0fea7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04e0fea7

Branch: refs/heads/master
Commit: 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3
Parents: badf7fa
Author: Hossein hoss...@databricks.com
Authored: Tue Aug 18 18:02:22 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 18:02:22 2015 -0700

--
 R/pkg/DESCRIPTION | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/04e0fea7/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 83e6489..d0d7201 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: SparkR
 Type: Package
 Title: R frontend for Spark
-Version: 1.4.0
+Version: 1.5.0
 Date: 2013-09-09
 Author: The Apache Software Foundation
 Maintainer: Shivaram Venkataraman shiva...@cs.berkeley.edu


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Bump SparkR version string to 1.5.0

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 4ee225af8 - 9b42e2404


Bump SparkR version string to 1.5.0

This patch is against master, but we need to apply it to 1.5 branch as well.

cc shivaram  and rxin

Author: Hossein hoss...@databricks.com

Closes #8291 from falaki/SparkRVersion1.5.

(cherry picked from commit 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3)
Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b42e240
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b42e240
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b42e240

Branch: refs/heads/branch-1.5
Commit: 9b42e24049e072b315ec80e5bbe2ec5079a94704
Parents: 4ee225a
Author: Hossein hoss...@databricks.com
Authored: Tue Aug 18 18:02:22 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 18:02:31 2015 -0700

--
 R/pkg/DESCRIPTION | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9b42e240/R/pkg/DESCRIPTION
--
diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION
index 83e6489..d0d7201 100644
--- a/R/pkg/DESCRIPTION
+++ b/R/pkg/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: SparkR
 Type: Package
 Title: R frontend for Spark
-Version: 1.4.0
+Version: 1.5.0
 Date: 2013-09-09
 Author: The Apache Software Foundation
 Maintainer: Shivaram Venkataraman shiva...@cs.berkeley.edu


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/master c635a16f6 - 9108eff74


[SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in 
FailureSuite

Failures in streaming.FailureSuite can leak StreamingContext and SparkContext 
which fails all subsequent tests

Author: Tathagata Das tathagata.das1...@gmail.com

Closes #8289 from tdas/SPARK-10098.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9108eff7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9108eff7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9108eff7

Branch: refs/heads/master
Commit: 9108eff74a2815986fd067b273c2a344b6315405
Parents: c635a16
Author: Tathagata Das tathagata.das1...@gmail.com
Authored: Tue Aug 18 17:00:13 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 17:00:13 2015 -0700

--
 .../apache/spark/streaming/FailureSuite.scala   | 27 
 1 file changed, 17 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9108eff7/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala 
b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
index 0c4c065..e82c2fa 100644
--- a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
+++ b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
@@ -17,25 +17,32 @@
 
 package org.apache.spark.streaming
 
-import org.apache.spark.Logging
+import java.io.File
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.{SparkFunSuite, Logging}
 import org.apache.spark.util.Utils
 
 /**
  * This testsuite tests master failures at random times while the stream is 
running using
  * the real clock.
  */
-class FailureSuite extends TestSuiteBase with Logging {
+class FailureSuite extends SparkFunSuite with BeforeAndAfter with Logging {
 
-  val directory = Utils.createTempDir()
-  val numBatches = 30
+  private val batchDuration: Duration = Milliseconds(1000)
+  private val numBatches = 30
+  private var directory: File = null
 
-  override def batchDuration: Duration = Milliseconds(1000)
-
-  override def useManualClock: Boolean = false
+  before {
+directory = Utils.createTempDir()
+  }
 
-  override def afterFunction() {
-Utils.deleteRecursively(directory)
-super.afterFunction()
+  after {
+if (directory != null) {
+  Utils.deleteRecursively(directory)
+}
+StreamingContext.getActive().foreach { _.stop() }
   }
 
   test(multiple failures with map) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 fb207b245 - e1b50c7d2


[SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in 
FailureSuite

Failures in streaming.FailureSuite can leak StreamingContext and SparkContext 
which fails all subsequent tests

Author: Tathagata Das tathagata.das1...@gmail.com

Closes #8289 from tdas/SPARK-10098.

(cherry picked from commit 9108eff74a2815986fd067b273c2a344b6315405)
Signed-off-by: Tathagata Das tathagata.das1...@gmail.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1b50c7d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1b50c7d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1b50c7d

Branch: refs/heads/branch-1.5
Commit: e1b50c7d2a604f785e5fe9af5d60c426a6ff01c2
Parents: fb207b2
Author: Tathagata Das tathagata.das1...@gmail.com
Authored: Tue Aug 18 17:00:13 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 17:00:21 2015 -0700

--
 .../apache/spark/streaming/FailureSuite.scala   | 27 
 1 file changed, 17 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e1b50c7d/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
--
diff --git 
a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala 
b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
index 0c4c065..e82c2fa 100644
--- a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
+++ b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala
@@ -17,25 +17,32 @@
 
 package org.apache.spark.streaming
 
-import org.apache.spark.Logging
+import java.io.File
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.{SparkFunSuite, Logging}
 import org.apache.spark.util.Utils
 
 /**
  * This testsuite tests master failures at random times while the stream is 
running using
  * the real clock.
  */
-class FailureSuite extends TestSuiteBase with Logging {
+class FailureSuite extends SparkFunSuite with BeforeAndAfter with Logging {
 
-  val directory = Utils.createTempDir()
-  val numBatches = 30
+  private val batchDuration: Duration = Milliseconds(1000)
+  private val numBatches = 30
+  private var directory: File = null
 
-  override def batchDuration: Duration = Milliseconds(1000)
-
-  override def useManualClock: Boolean = false
+  before {
+directory = Utils.createTempDir()
+  }
 
-  override def afterFunction() {
-Utils.deleteRecursively(directory)
-super.afterFunction()
+  after {
+if (directory != null) {
+  Utils.deleteRecursively(directory)
+}
+StreamingContext.getActive().foreach { _.stop() }
   }
 
   test(multiple failures with map) {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e1b50c7d2 - 4ee225af8


[SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT

mengxr jkbradley

Author: Feynman Liang fli...@databricks.com

Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.

(cherry picked from commit badf7fa650f9801c70515907fcc26b58d7ec3143)
Signed-off-by: Joseph K. Bradley jos...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ee225af
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ee225af
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ee225af

Branch: refs/heads/branch-1.5
Commit: 4ee225af8ecb38fbcf8e43ac1c498a76f3590b98
Parents: e1b50c7
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 17:54:49 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 17:54:58 2015 -0700

--
 docs/ml-features.md | 71 
 1 file changed, 71 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4ee225af/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6b2e36b..28a6193 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3):
 /div
 /div
 
+## Discrete Cosine Transform (DCT)
+
+The [Discrete Cosine
+Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
+transforms a length $N$ real-valued sequence in the time domain into
+another length $N$ real-valued sequence in the frequency domain. A
+[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
+provides this functionality, implementing the
+[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
+and scaling the result by $1/\sqrt{2}$ such that the representing matrix
+for the transform is unitary. No shift is applied to the transformed
+sequence (e.g. the $0$th element of the transformed sequence is the
+$0$th DCT coefficient and _not_ the $N/2$th).
+
+div class=codetabs
+div data-lang=scala markdown=1
+{% highlight scala %}
+import org.apache.spark.ml.feature.DCT
+import org.apache.spark.mllib.linalg.Vectors
+
+val data = Seq(
+  Vectors.dense(0.0, 1.0, -2.0, 3.0),
+  Vectors.dense(-1.0, 2.0, 4.0, -7.0),
+  Vectors.dense(14.0, -2.0, -5.0, 1.0))
+val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features)
+val dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false)
+val dctDf = dct.transform(df)
+dctDf.select(featuresDCT).show(3)
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.ml.feature.DCT;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDDRow data = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
+  RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)),
+  RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0))
+));
+StructType schema = new StructType(new StructField[] {
+  new StructField(features, new VectorUDT(), false, Metadata.empty()),
+});
+DataFrame df = jsql.createDataFrame(data, schema);
+DCT dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false);
+DataFrame dctDf = dct.transform(df);
+dctDf.select(featuresDCT).show(3);
+{% endhighlight %}
+/div
+/div
+
 ## StringIndexer
 
 `StringIndexer` encodes a string column of labels to a column of label indices.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT

2015-08-18 Thread jkbradley

Repository: spark
Updated Branches:
  refs/heads/master 9108eff74 - badf7fa65


[SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT

mengxr jkbradley

Author: Feynman Liang fli...@databricks.com

Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/badf7fa6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/badf7fa6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/badf7fa6

Branch: refs/heads/master
Commit: badf7fa650f9801c70515907fcc26b58d7ec3143
Parents: 9108eff
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 17:54:49 2015 -0700
Committer: Joseph K. Bradley jos...@databricks.com
Committed: Tue Aug 18 17:54:49 2015 -0700

--
 docs/ml-features.md | 71 
 1 file changed, 71 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/badf7fa6/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6b2e36b..28a6193 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3):
 /div
 /div
 
+## Discrete Cosine Transform (DCT)
+
+The [Discrete Cosine
+Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
+transforms a length $N$ real-valued sequence in the time domain into
+another length $N$ real-valued sequence in the frequency domain. A
+[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
+provides this functionality, implementing the
+[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
+and scaling the result by $1/\sqrt{2}$ such that the representing matrix
+for the transform is unitary. No shift is applied to the transformed
+sequence (e.g. the $0$th element of the transformed sequence is the
+$0$th DCT coefficient and _not_ the $N/2$th).
+
+div class=codetabs
+div data-lang=scala markdown=1
+{% highlight scala %}
+import org.apache.spark.ml.feature.DCT
+import org.apache.spark.mllib.linalg.Vectors
+
+val data = Seq(
+  Vectors.dense(0.0, 1.0, -2.0, 3.0),
+  Vectors.dense(-1.0, 2.0, 4.0, -7.0),
+  Vectors.dense(14.0, -2.0, -5.0, 1.0))
+val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features)
+val dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false)
+val dctDf = dct.transform(df)
+dctDf.select(featuresDCT).show(3)
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.ml.feature.DCT;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+JavaRDDRow data = jsc.parallelize(Arrays.asList(
+  RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)),
+  RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)),
+  RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0))
+));
+StructType schema = new StructType(new StructField[] {
+  new StructField(features, new VectorUDT(), false, Metadata.empty()),
+});
+DataFrame df = jsql.createDataFrame(data, schema);
+DCT dct = new DCT()
+  .setInputCol(features)
+  .setOutputCol(featuresDCT)
+  .setInverse(false);
+DataFrame dctDf = dct.transform(df);
+dctDf.select(featuresDCT).show(3);
+{% endhighlight %}
+/div
+/div
+
 ## StringIndexer
 
 `StringIndexer` encodes a string column of labels to a column of label indices.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9705] [DOC] fix docs about Python version

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 3c33931aa - 03a8a889a


[SPARK-9705] [DOC] fix docs about Python version

cc JoshRosen

Author: Davies Liu dav...@databricks.com

Closes #8245 from davies/python_doc.

(cherry picked from commit de3223872a217c5224ba7136604f6b7753b29108)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03a8a889
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03a8a889
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03a8a889

Branch: refs/heads/branch-1.5
Commit: 03a8a889a98ab30e4d33dc1a415aa84253111ffa
Parents: 3c33931
Author: Davies Liu dav...@databricks.com
Authored: Tue Aug 18 22:11:27 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:11:32 2015 -0700

--
 docs/configuration.md |  6 +-
 docs/programming-guide.md | 12 ++--
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/03a8a889/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 3214709..4a6e4dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1561,7 +1561,11 @@ The following variables can be set in `spark-env.sh`:
   /tr
   tr
 tdcodePYSPARK_PYTHON/code/td
-tdPython binary executable to use for PySpark./td
+tdPython binary executable to use for PySpark in both driver and workers 
(default is `python`)./td
+  /tr
+  tr
+tdcodePYSPARK_DRIVER_PYTHON/code/td
+tdPython binary executable to use for PySpark in driver only (default is 
PYSPARK_PYTHON)./td
   /tr
   tr
 tdcodeSPARK_LOCAL_IP/code/td

http://git-wip-us.apache.org/repos/asf/spark/blob/03a8a889/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index ae712d6..982c5ea 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -85,8 +85,8 @@ import org.apache.spark.SparkConf
 
 div data-lang=python  markdown=1
 
-Spark {{site.SPARK_VERSION}} works with Python 2.6 or higher (but not Python 
3). It uses the standard CPython interpreter,
-so C libraries like NumPy can be used.
+Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use 
the standard CPython interpreter,
+so C libraries like NumPy can be used. It also works with PyPy 2.3+.
 
 To run Spark applications in Python, use the `bin/spark-submit` script located 
in the Spark directory.
 This script will load Spark's Java/Scala libraries and allow you to submit 
applications to a cluster.
@@ -104,6 +104,14 @@ Finally, you need to import some Spark classes into your 
program. Add the follow
 from pyspark import SparkContext, SparkConf
 {% endhighlight %}
 
+PySpark requires the same minor version of Python in both driver and workers. 
It uses the default python version in PATH,
+you can specify which version of Python you want to use by `PYSPARK_PYTHON`, 
for example:
+
+{% highlight bash %}
+$ PYSPARK_PYTHON=python3.4 bin/pyspark
+$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
+{% endhighlight %}
+
 /div
 
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9705] [DOC] fix docs about Python version

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 1ff0580ed - de3223872


[SPARK-9705] [DOC] fix docs about Python version

cc JoshRosen

Author: Davies Liu dav...@databricks.com

Closes #8245 from davies/python_doc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de322387
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de322387
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de322387

Branch: refs/heads/master
Commit: de3223872a217c5224ba7136604f6b7753b29108
Parents: 1ff0580
Author: Davies Liu dav...@databricks.com
Authored: Tue Aug 18 22:11:27 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:11:27 2015 -0700

--
 docs/configuration.md |  6 +-
 docs/programming-guide.md | 12 ++--
 2 files changed, 15 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/de322387/docs/configuration.md
--
diff --git a/docs/configuration.md b/docs/configuration.md
index 3214709..4a6e4dd 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1561,7 +1561,11 @@ The following variables can be set in `spark-env.sh`:
   /tr
   tr
 tdcodePYSPARK_PYTHON/code/td
-tdPython binary executable to use for PySpark./td
+tdPython binary executable to use for PySpark in both driver and workers 
(default is `python`)./td
+  /tr
+  tr
+tdcodePYSPARK_DRIVER_PYTHON/code/td
+tdPython binary executable to use for PySpark in driver only (default is 
PYSPARK_PYTHON)./td
   /tr
   tr
 tdcodeSPARK_LOCAL_IP/code/td

http://git-wip-us.apache.org/repos/asf/spark/blob/de322387/docs/programming-guide.md
--
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index ae712d6..982c5ea 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -85,8 +85,8 @@ import org.apache.spark.SparkConf
 
 div data-lang=python  markdown=1
 
-Spark {{site.SPARK_VERSION}} works with Python 2.6 or higher (but not Python 
3). It uses the standard CPython interpreter,
-so C libraries like NumPy can be used.
+Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use 
the standard CPython interpreter,
+so C libraries like NumPy can be used. It also works with PyPy 2.3+.
 
 To run Spark applications in Python, use the `bin/spark-submit` script located 
in the Spark directory.
 This script will load Spark's Java/Scala libraries and allow you to submit 
applications to a cluster.
@@ -104,6 +104,14 @@ Finally, you need to import some Spark classes into your 
program. Add the follow
 from pyspark import SparkContext, SparkConf
 {% endhighlight %}
 
+PySpark requires the same minor version of Python in both driver and workers. 
It uses the default python version in PATH,
+you can specify which version of Python you want to use by `PYSPARK_PYTHON`, 
for example:
+
+{% highlight bash %}
+$ PYSPARK_PYTHON=python3.4 bin/pyspark
+$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
+{% endhighlight %}
+
 /div
 
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 1c843e284 - 010b03ed5


[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses 
cacheLocs

In Scala, `Seq.fill` always seems to return a List. Accessing a list by index 
is an O(N) operation. Thus, the following code will be really slow (~10 seconds 
on my machine):

```scala
val numItems = 10
val s = Seq.fill(numItems)(1)
for (i - 0 until numItems) s(i)
```

It turns out that we had a loop like this in DAGScheduler code, although it's a 
little tricky to spot. In `getPreferredLocsInternal`, there's a call to 
`getCacheLocs(rdd)(partition)`.  The `getCacheLocs` call returns a Seq. If this 
Seq is a List and the RDD contains many partitions, then indexing into this 
list will cost O(partitions). Thus, when we loop over our tasks to compute 
their individual preferred locations we implicitly perform an N^2 loop, 
reducing scheduling throughput.

This patch fixes this by replacing `Seq` with `Array`.

Author: Josh Rosen joshro...@databricks.com

Closes #8178 from JoshRosen/dagscheduler-perf.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/010b03ed
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/010b03ed
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/010b03ed

Branch: refs/heads/master
Commit: 010b03ed52f35fd4d426d522f8a9927ddc579209
Parents: 1c843e2
Author: Josh Rosen joshro...@databricks.com
Authored: Tue Aug 18 22:30:13 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:30:13 2015 -0700

--
 .../apache/spark/scheduler/DAGScheduler.scala   | 22 ++--
 .../spark/storage/BlockManagerMaster.scala  |  5 +++--
 .../storage/BlockManagerMasterEndpoint.scala|  3 ++-
 .../spark/scheduler/DAGSchedulerSuite.scala |  4 ++--
 4 files changed, 18 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/010b03ed/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index dadf83a..684db66 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -111,7 +111,7 @@ class DAGScheduler(
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
*/
-  private val cacheLocs = new HashMap[Int, Seq[Seq[TaskLocation]]]
+  private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
   // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
   // every task. When we detect a node failing, we note the current epoch 
number and failed
@@ -205,12 +205,12 @@ class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): Seq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
+  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
 // Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
 if (!cacheLocs.contains(rdd.id)) {
   // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: Seq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-Seq.fill(rdd.partitions.size)(Nil)
+  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
+IndexedSeq.fill(rdd.partitions.length)(Nil)
   } else {
 val blockIds =
   rdd.partitions.indices.map(index = RDDBlockId(rdd.id, 
index)).toArray[BlockId]
@@ -302,12 +302,12 @@ class DAGScheduler(
   shuffleDep: ShuffleDependency[_, _, _],
   firstJobId: Int): ShuffleMapStage = {
 val rdd = shuffleDep.rdd
-val numTasks = rdd.partitions.size
+val numTasks = rdd.partitions.length
 val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, 
rdd.creationSite)
 if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  for (i - 0 until locs.size) {
+  for (i - 0 until locs.length) {
 stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null 
if missing
   }
   stage.numAvailableOutputs = locs.count(_ != null)
@@ -315,7 +315,7 @@ class DAGScheduler(
   // Kind of ugly: need to register RDDs with the cache and map output 
tracker here
   // since we can't do it in the RDD constructor because # of partitions 
is unknown

spark git commit: [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master 270ee6777 - 1ff0580ed


[SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors  fix UDFs 
on complex types

This is kind of a weird case, but given a sufficiently complex query plan (in 
this case a TungstenProject with an Exchange underneath), we could have NPEs on 
the executors due to the time when we were calling transformAllExpressions

In general we should ensure that all transformations occur on the driver and 
not on the executors. Some reasons for avoid executor side transformations 
include:

* (this case) Some operator constructors require state such as access to the 
Spark/SQL conf so doing a makeCopy on the executor can fail.
* (unrelated reason for avoid executor transformations) ExprIds are calculated 
using an atomic integer, so you can violate their uniqueness constraint by 
constructing them anywhere other than the driver.

This subsumes #8285.

Author: Reynold Xin r...@databricks.com
Author: Michael Armbrust mich...@databricks.com

Closes #8295 from rxin/SPARK-10096.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ff0580e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ff0580e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ff0580e

Branch: refs/heads/master
Commit: 1ff0580eda90f9247a5233809667f5cebaea290e
Parents: 270ee67
Author: Reynold Xin r...@databricks.com
Authored: Tue Aug 18 22:08:15 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:08:15 2015 -0700

--
 .../expressions/complexTypeCreator.scala|  8 +++-
 .../spark/sql/execution/basicOperators.scala| 12 ++---
 .../spark/sql/DataFrameComplexTypeSuite.scala   | 46 
 .../org/apache/spark/sql/DataFrameSuite.scala   |  9 
 4 files changed, 68 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1ff0580e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index 298aee3..1c54671 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -206,7 +206,9 @@ case class CreateStructUnsafe(children: Seq[Expression]) 
extends Expression {
 
   override def nullable: Boolean = false
 
-  override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
+  override def eval(input: InternalRow): Any = {
+InternalRow(children.map(_.eval(input)): _*)
+  }
 
   override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
 val eval = GenerateUnsafeProjection.createCode(ctx, children)
@@ -244,7 +246,9 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def nullable: Boolean = false
 
-  override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
+  override def eval(input: InternalRow): Any = {
+InternalRow(valExprs.map(_.eval(input)): _*)
+  }
 
   override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
 val eval = GenerateUnsafeProjection.createCode(ctx, valExprs)

http://git-wip-us.apache.org/repos/asf/spark/blob/1ff0580e/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
index 77b9806..3f68b05 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
@@ -75,14 +75,16 @@ case class TungstenProject(projectList: 
Seq[NamedExpression], child: SparkPlan)
 
   override def output: Seq[Attribute] = projectList.map(_.toAttribute)
 
+  /** Rewrite the project list to use unsafe expressions as needed. */
+  protected val unsafeProjectList = projectList.map(_ transform {
+case CreateStruct(children) = CreateStructUnsafe(children)
+case CreateNamedStruct(children) = CreateNamedStructUnsafe(children)
+  })
+
   protected override def doExecute(): RDD[InternalRow] = {
 val numRows = longMetric(numRows)
 child.execute().mapPartitions { iter =
-  this.transformAllExpressions {
-case CreateStruct(children) = CreateStructUnsafe(children)
-case

spark git commit: [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 11c933505 - 3c33931aa


[SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors  fix UDFs 
on complex types

This is kind of a weird case, but given a sufficiently complex query plan (in 
this case a TungstenProject with an Exchange underneath), we could have NPEs on 
the executors due to the time when we were calling transformAllExpressions

In general we should ensure that all transformations occur on the driver and 
not on the executors. Some reasons for avoid executor side transformations 
include:

* (this case) Some operator constructors require state such as access to the 
Spark/SQL conf so doing a makeCopy on the executor can fail.
* (unrelated reason for avoid executor transformations) ExprIds are calculated 
using an atomic integer, so you can violate their uniqueness constraint by 
constructing them anywhere other than the driver.

This subsumes #8285.

Author: Reynold Xin r...@databricks.com
Author: Michael Armbrust mich...@databricks.com

Closes #8295 from rxin/SPARK-10096.

(cherry picked from commit 1ff0580eda90f9247a5233809667f5cebaea290e)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c33931a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c33931a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c33931a

Branch: refs/heads/branch-1.5
Commit: 3c33931aa58db8ccc138e7f2e3c8ee94d25c7242
Parents: 11c9335
Author: Reynold Xin r...@databricks.com
Authored: Tue Aug 18 22:08:15 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:08:22 2015 -0700

--
 .../expressions/complexTypeCreator.scala|  8 +++-
 .../spark/sql/execution/basicOperators.scala| 12 ++---
 .../spark/sql/DataFrameComplexTypeSuite.scala   | 46 
 .../org/apache/spark/sql/DataFrameSuite.scala   |  9 
 4 files changed, 68 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3c33931a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
index 298aee3..1c54671 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
@@ -206,7 +206,9 @@ case class CreateStructUnsafe(children: Seq[Expression]) 
extends Expression {
 
   override def nullable: Boolean = false
 
-  override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
+  override def eval(input: InternalRow): Any = {
+InternalRow(children.map(_.eval(input)): _*)
+  }
 
   override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
 val eval = GenerateUnsafeProjection.createCode(ctx, children)
@@ -244,7 +246,9 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def nullable: Boolean = false
 
-  override def eval(input: InternalRow): Any = throw new 
UnsupportedOperationException
+  override def eval(input: InternalRow): Any = {
+InternalRow(valExprs.map(_.eval(input)): _*)
+  }
 
   override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): 
String = {
 val eval = GenerateUnsafeProjection.createCode(ctx, valExprs)

http://git-wip-us.apache.org/repos/asf/spark/blob/3c33931a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
index 77b9806..3f68b05 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala
@@ -75,14 +75,16 @@ case class TungstenProject(projectList: 
Seq[NamedExpression], child: SparkPlan)
 
   override def output: Seq[Attribute] = projectList.map(_.toAttribute)
 
+  /** Rewrite the project list to use unsafe expressions as needed. */
+  protected val unsafeProjectList = projectList.map(_ transform {
+case CreateStruct(children) = CreateStructUnsafe(children)
+case CreateNamedStruct(children) = CreateNamedStructUnsafe(children)
+  })
+
   protected override def doExecute(): RDD[InternalRow] = {
 val numRows = longMetric(numRows)
 child.execute().mapPartitions { iter

spark git commit: [SPARK-9508] GraphX Pregel docs update with new Pregel code

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 03a8a889a - 416392697


[SPARK-9508] GraphX Pregel docs update with new Pregel code

SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be 
modified accordingly since it lists the old Pregel code

Author: Alexander Ulanov na...@yandex.ru

Closes #7831 from avulanov/SPARK-9508-pregel-doc2.

(cherry picked from commit 1c843e284818004f16c3f1101c33b510f80722e3)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41639269
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41639269
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41639269

Branch: refs/heads/branch-1.5
Commit: 416392697f20de27b37db0cf0bad15a0e5ac9ebe
Parents: 03a8a88
Author: Alexander Ulanov na...@yandex.ru
Authored: Tue Aug 18 22:13:52 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:13:57 2015 -0700

--
 docs/graphx-programming-guide.md | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/41639269/docs/graphx-programming-guide.md
--
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 99f8c82..c861a763 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -768,16 +768,14 @@ class GraphOps[VD, ED] {
 // Loop until no messages remain or maxIterations is achieved
 var i = 0
 while (activeMessages  0  i  maxIterations) {
-  // Receive the messages: 
---
-  // Run the vertex program on all vertices that receive messages
-  val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
-  // Merge the new vertex values back into the graph
-  g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
newOpt.getOrElse(old) }.cache()
-  // Send Messages: 
--
-  // Vertices that didn't receive a message above don't appear in newVerts 
and therefore don't
-  // get to send messages.  More precisely the map phase of 
mapReduceTriplets is only invoked
-  // on edges in the activeDir of vertices in newVerts
-  messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, 
activeDir))).cache()
+  // Receive the messages and update the vertices.
+  g = g.joinVertices(messages)(vprog).cache()
+  val oldMessages = messages
+  // Send new messages, skipping edges where neither side received a 
message. We must cache
+  // messages so it can be materialized on the next line, allowing us to 
uncache the previous
+  // iteration.
+  messages = g.mapReduceTriplets(
+sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
   activeMessages = messages.count()
   i += 1
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9508] GraphX Pregel docs update with new Pregel code

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/master de3223872 - 1c843e284


[SPARK-9508] GraphX Pregel docs update with new Pregel code

SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be 
modified accordingly since it lists the old Pregel code

Author: Alexander Ulanov na...@yandex.ru

Closes #7831 from avulanov/SPARK-9508-pregel-doc2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c843e28
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c843e28
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c843e28

Branch: refs/heads/master
Commit: 1c843e284818004f16c3f1101c33b510f80722e3
Parents: de32238
Author: Alexander Ulanov na...@yandex.ru
Authored: Tue Aug 18 22:13:52 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:13:52 2015 -0700

--
 docs/graphx-programming-guide.md | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1c843e28/docs/graphx-programming-guide.md
--
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 99f8c82..c861a763 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -768,16 +768,14 @@ class GraphOps[VD, ED] {
 // Loop until no messages remain or maxIterations is achieved
 var i = 0
 while (activeMessages  0  i  maxIterations) {
-  // Receive the messages: 
---
-  // Run the vertex program on all vertices that receive messages
-  val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
-  // Merge the new vertex values back into the graph
-  g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
newOpt.getOrElse(old) }.cache()
-  // Send Messages: 
--
-  // Vertices that didn't receive a message above don't appear in newVerts 
and therefore don't
-  // get to send messages.  More precisely the map phase of 
mapReduceTriplets is only invoked
-  // on edges in the activeDir of vertices in newVerts
-  messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, 
activeDir))).cache()
+  // Receive the messages and update the vertices.
+  g = g.joinVertices(messages)(vprog).cache()
+  val oldMessages = messages
+  // Send new messages, skipping edges where neither side received a 
message. We must cache
+  // messages so it can be materialized on the next line, allowing us to 
uncache the previous
+  // iteration.
+  messages = g.mapReduceTriplets(
+sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
   activeMessages = messages.count()
   i += 1
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs

2015-08-18 Thread rxin

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 416392697 - 3ceee5572


[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses 
cacheLocs

In Scala, `Seq.fill` always seems to return a List. Accessing a list by index 
is an O(N) operation. Thus, the following code will be really slow (~10 seconds 
on my machine):

```scala
val numItems = 10
val s = Seq.fill(numItems)(1)
for (i - 0 until numItems) s(i)
```

It turns out that we had a loop like this in DAGScheduler code, although it's a 
little tricky to spot. In `getPreferredLocsInternal`, there's a call to 
`getCacheLocs(rdd)(partition)`.  The `getCacheLocs` call returns a Seq. If this 
Seq is a List and the RDD contains many partitions, then indexing into this 
list will cost O(partitions). Thus, when we loop over our tasks to compute 
their individual preferred locations we implicitly perform an N^2 loop, 
reducing scheduling throughput.

This patch fixes this by replacing `Seq` with `Array`.

Author: Josh Rosen joshro...@databricks.com

Closes #8178 from JoshRosen/dagscheduler-perf.

(cherry picked from commit 010b03ed52f35fd4d426d522f8a9927ddc579209)
Signed-off-by: Reynold Xin r...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ceee557
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ceee557
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ceee557

Branch: refs/heads/branch-1.5
Commit: 3ceee5572fc7be690d3009f4d43af9e4611c0fa1
Parents: 4163926
Author: Josh Rosen joshro...@databricks.com
Authored: Tue Aug 18 22:30:13 2015 -0700
Committer: Reynold Xin r...@databricks.com
Committed: Tue Aug 18 22:30:20 2015 -0700

--
 .../apache/spark/scheduler/DAGScheduler.scala   | 22 ++--
 .../spark/storage/BlockManagerMaster.scala  |  5 +++--
 .../storage/BlockManagerMasterEndpoint.scala|  3 ++-
 .../spark/scheduler/DAGSchedulerSuite.scala |  4 ++--
 4 files changed, 18 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3ceee557/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
--
diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
index dadf83a..684db66 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
@@ -111,7 +111,7 @@ class DAGScheduler(
*
* All accesses to this map should be guarded by synchronizing on it (see 
SPARK-4454).
*/
-  private val cacheLocs = new HashMap[Int, Seq[Seq[TaskLocation]]]
+  private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
 
   // For tracking failed nodes, we use the MapOutputTracker's epoch number, 
which is sent with
   // every task. When we detect a node failing, we note the current epoch 
number and failed
@@ -205,12 +205,12 @@ class DAGScheduler(
   }
 
   private[scheduler]
-  def getCacheLocs(rdd: RDD[_]): Seq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
+  def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = 
cacheLocs.synchronized {
 // Note: this doesn't use `getOrElse()` because this method is called 
O(num tasks) times
 if (!cacheLocs.contains(rdd.id)) {
   // Note: if the storage level is NONE, we don't need to get locations 
from block manager.
-  val locs: Seq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
-Seq.fill(rdd.partitions.size)(Nil)
+  val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == 
StorageLevel.NONE) {
+IndexedSeq.fill(rdd.partitions.length)(Nil)
   } else {
 val blockIds =
   rdd.partitions.indices.map(index = RDDBlockId(rdd.id, 
index)).toArray[BlockId]
@@ -302,12 +302,12 @@ class DAGScheduler(
   shuffleDep: ShuffleDependency[_, _, _],
   firstJobId: Int): ShuffleMapStage = {
 val rdd = shuffleDep.rdd
-val numTasks = rdd.partitions.size
+val numTasks = rdd.partitions.length
 val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, 
rdd.creationSite)
 if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
   val serLocs = 
mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
   val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
-  for (i - 0 until locs.size) {
+  for (i - 0 until locs.length) {
 stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null 
if missing
   }
   stage.numAvailableOutputs = locs.count(_ != null)
@@ -315,7 +315,7 @@ class DAGScheduler(
   // Kind of ugly: need to register RDDs with the cache and map output

spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f4fa61eff - 747c2ba80


[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

Add Python example for mllib LDAModel user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8227 from yanboliang/spark-10032.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/747c2ba8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/747c2ba8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/747c2ba8

Branch: refs/heads/master
Commit: 747c2ba8006d5b86f3be8dfa9ace639042a35628
Parents: f4fa61e
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:56:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:56:36 2015 -0700

--
 docs/mllib-clustering.md | 28 
 1 file changed, 28 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/747c2ba8/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index bb875ae..fd9ab25 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -564,6 +564,34 @@ public class JavaLDAExample {
 {% endhighlight %}
 /div
 
+div data-lang=python markdown=1
+{% highlight python %}
+from pyspark.mllib.clustering import LDA, LDAModel
+from pyspark.mllib.linalg import Vectors
+
+# Load and parse the data
+data = sc.textFile(data/mllib/sample_lda_data.txt)
+parsedData = data.map(lambda line: Vectors.dense([float(x) for x in 
line.strip().split(' ')]))
+# Index documents with unique IDs
+corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
+
+# Cluster the documents into three topics using LDA
+ldaModel = LDA.train(corpus, k=3)
+
+# Output topics. Each is a distribution over words (matching word count 
vectors)
+print(Learned topics (as distributions over vocab of  + 
str(ldaModel.vocabSize()) +  words):)
+topics = ldaModel.topicsMatrix()
+for topic in range(3):
+print(Topic  + str(topic) + :)
+for word in range(0, ldaModel.vocabSize()):
+print(  + str(topics[word][topic]))
+   
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = LDAModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
+
 /div
 
 ## Streaming k-means


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 80debff12 - ec7079f9c


[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

Add Python example for mllib LDAModel user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8227 from yanboliang/spark-10032.

(cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec7079f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec7079f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec7079f9

Branch: refs/heads/branch-1.5
Commit: ec7079f9c94cb98efdac6f92b7c85efb0e67492e
Parents: 80debff
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:56:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:56:43 2015 -0700

--
 docs/mllib-clustering.md | 28 
 1 file changed, 28 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ec7079f9/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index bb875ae..fd9ab25 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -564,6 +564,34 @@ public class JavaLDAExample {
 {% endhighlight %}
 /div
 
+div data-lang=python markdown=1
+{% highlight python %}
+from pyspark.mllib.clustering import LDA, LDAModel
+from pyspark.mllib.linalg import Vectors
+
+# Load and parse the data
+data = sc.textFile(data/mllib/sample_lda_data.txt)
+parsedData = data.map(lambda line: Vectors.dense([float(x) for x in 
line.strip().split(' ')]))
+# Index documents with unique IDs
+corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
+
+# Cluster the documents into three topics using LDA
+ldaModel = LDA.train(corpus, k=3)
+
+# Output topics. Each is a distribution over words (matching word count 
vectors)
+print(Learned topics (as distributions over vocab of  + 
str(ldaModel.vocabSize()) +  words):)
+topics = ldaModel.topicsMatrix()
+for topic in range(3):
+print(Topic  + str(topic) + :)
+for word in range(0, ldaModel.vocabSize()):
+print(  + str(topics[word][topic]))
+   
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = LDAModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
+
 /div
 
 ## Streaming k-means


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 7ff0e5d2f - 80debff12


[SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression 
user guide

Add Python examples for mllib IsotonicRegression user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8225 from yanboliang/spark-10029.

(cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80debff1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80debff1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80debff1

Branch: refs/heads/branch-1.5
Commit: 80debff123e0b5dcc4e6f5899753a736de2c8e75
Parents: 7ff0e5d
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:55:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:55:42 2015 -0700

--
 docs/mllib-isotonic-regression.md | 35 ++
 1 file changed, 35 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80debff1/docs/mllib-isotonic-regression.md
--
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 5732bc4..6aa881f 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath);
 IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), 
myModelPath);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is calculated 
from the predicted
+labels and real labels in the test set.
+
+{% highlight python %}
+import math
+from pyspark.mllib.regression import IsotonicRegression, 
IsotonicRegressionModel
+
+data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt)
+
+# Create label, feature, weight tuples from input data with weight set to 
default value 1.0.
+parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) 
+ (1.0,))
+
+# Split data into training (60%) and test (40%) sets.
+training, test = parsedData.randomSplit([0.6, 0.4], 11)
+
+# Create isotonic regression model from training data.
+# Isotonic parameter defaults to true so it is only shown for demonstration
+model = IsotonicRegression.train(training)
+
+# Create tuples of predicted and real labels.
+predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0]))
+
+# Calculate mean squared error between predicted and real labels.
+meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 
2)).mean()
+print(Mean Squared Error =  + str(meanSquaredError))
+
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = IsotonicRegressionModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars

2015-08-18 Thread tdas

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 9bd2e6f7c - 2bccd918f


[SPARK-9574] [STREAMING] Remove unnecessary contents of 
spark-streaming-XXX-assembly jars

Removed contents already included in Spark assembly jar from 
spark-streaming-XXX-assembly jars.

Author: zsxwing zsxw...@gmail.com

Closes #8069 from zsxwing/SPARK-9574.

(cherry picked from commit bf1d6614dcb8f5974e62e406d9c0f8aac52556d3)
Signed-off-by: Tathagata Das tathagata.das1...@gmail.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2bccd918
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2bccd918
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2bccd918

Branch: refs/heads/branch-1.5
Commit: 2bccd918fcd278f0b544e61b9675ecdf2d6974b3
Parents: 9bd2e6f
Author: zsxwing zsxw...@gmail.com
Authored: Tue Aug 18 13:35:45 2015 -0700
Committer: Tathagata Das tathagata.das1...@gmail.com
Committed: Tue Aug 18 13:36:25 2015 -0700

--
 external/flume-assembly/pom.xml | 11 +
 external/kafka-assembly/pom.xml | 84 
 external/mqtt-assembly/pom.xml  | 74 
 extras/kinesis-asl-assembly/pom.xml | 79 ++
 pom.xml |  2 +-
 5 files changed, 249 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2bccd918/external/flume-assembly/pom.xml
--
diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml
index 1318959..e05e431 100644
--- a/external/flume-assembly/pom.xml
+++ b/external/flume-assembly/pom.xml
@@ -69,6 +69,11 @@
   scopeprovided/scope
 /dependency
 dependency
+  groupIdcommons-lang/groupId
+  artifactIdcommons-lang/artifactId
+  scopeprovided/scope
+/dependency
+dependency
   groupIdcommons-net/groupId
   artifactIdcommons-net/artifactId
   scopeprovided/scope
@@ -89,6 +94,12 @@
   scopeprovided/scope
 /dependency
 dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro-mapred/artifactId
+  classifier${avro.mapred.classifier}/classifier
+  scopeprovided/scope
+/dependency
+dependency
   groupIdorg.scala-lang/groupId
   artifactIdscala-library/artifactId
   scopeprovided/scope

http://git-wip-us.apache.org/repos/asf/spark/blob/2bccd918/external/kafka-assembly/pom.xml
--
diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml
index 977514f..36342f3 100644
--- a/external/kafka-assembly/pom.xml
+++ b/external/kafka-assembly/pom.xml
@@ -47,6 +47,90 @@
   version${project.version}/version
   scopeprovided/scope
 /dependency
+!--
+  Demote already included in the Spark assembly.
+--
+dependency
+  groupIdcommons-codec/groupId
+  artifactIdcommons-codec/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcommons-lang/groupId
+  artifactIdcommons-lang/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.google.protobuf/groupId
+  artifactIdprotobuf-java/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.sun.jersey/groupId
+  artifactIdjersey-server/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdcom.sun.jersey/groupId
+  artifactIdjersey-core/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdnet.jpountz.lz4/groupId
+  artifactIdlz4/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.hadoop/groupId
+  artifactIdhadoop-client/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.avro/groupId
+  artifactIdavro-mapred/artifactId
+  classifier${avro.mapred.classifier}/classifier
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.curator/groupId
+  artifactIdcurator-recipes/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.apache.zookeeper/groupId
+  artifactIdzookeeper/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdlog4j/groupId
+  artifactIdlog4j/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdnet.java.dev.jets3t/groupId
+  artifactIdjets3t/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.scala-lang/groupId
+  artifactIdscala-library/artifactId
+  scopeprovided/scope
+/dependency
+dependency
+  groupIdorg.slf4j/groupId
+

spark git commit: [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/master 5723d26d7 - 1968276af


[SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters 
functions

### JIRA
[[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters 
functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007)

Author: Yuu ISHIKAWA yuu.ishik...@gmail.com

Closes #8277 from yu-iskw/SPARK-10007.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1968276a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1968276a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1968276a

Branch: refs/heads/master
Commit: 1968276af0f681fe51328b7dd795bd21724a5441
Parents: 5723d26
Author: Yuu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 09:10:59 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 09:10:59 2015 -0700

--
 R/pkg/NAMESPACE | 50 +++---
 1 file changed, 47 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1968276a/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index fd9dfdf..607aef2 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -87,48 +87,86 @@ exportMethods(abs,
   alias,
   approxCountDistinct,
   asc,
+  ascii,
   asin,
   atan,
   atan2,
   avg,
+  base64,
   between,
+  bin,
+  bitwiseNOT,
   cast,
   cbrt,
+  ceil,
   ceiling,
+  concat,
   contains,
   cos,
   cosh,
-  concat,
+  count,
   countDistinct,
+  crc32,
+  datediff,
+  dayofmonth,
+  dayofyear,
   desc,
   endsWith,
   exp,
+  explode,
   expm1,
+  factorial,
+  first,
   floor,
   getField,
   getItem,
   greatest,
+  hex,
+  hour,
   hypot,
+  initcap,
+  isNaN,
   isNotNull,
   isNull,
-  lit,
   last,
+  last_day,
   least,
+  length,
+  levenshtein,
   like,
+  lit,
   log,
   log10,
   log1p,
+  log2,
   lower,
+  ltrim,
   max,
+  md5,
   mean,
   min,
+  minute,
+  month,
+  months_between,
   n,
   n_distinct,
+  nanvl,
+  negate,
+  pmod,
+  quarter,
+  reverse,
   rint,
   rlike,
+  round,
+  rtrim,
+  second,
+  sha1,
   sign,
+  signum,
   sin,
   sinh,
+  size,
+  soundex,
   sqrt,
   startsWith,
   substr,
@@ -138,7 +176,13 @@ exportMethods(abs,
   tanh,
   toDegrees,
   toRadians,
-  upper)
+  to_date,
+  trim,
+  unbase64,
+  unhex,
+  upper,
+  weekofyear,
+  year)
 
 exportClasses(GroupedData)
 exportMethods(agg)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions

2015-08-18 Thread shivaram

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 a512250cd - 20a760a00


[SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters 
functions

### JIRA
[[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters 
functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007)

Author: Yuu ISHIKAWA yuu.ishik...@gmail.com

Closes #8277 from yu-iskw/SPARK-10007.

(cherry picked from commit 1968276af0f681fe51328b7dd795bd21724a5441)
Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/20a760a0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/20a760a0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/20a760a0

Branch: refs/heads/branch-1.5
Commit: 20a760a00ae188a68b877f052842834e8b7570e6
Parents: a512250
Author: Yuu ISHIKAWA yuu.ishik...@gmail.com
Authored: Tue Aug 18 09:10:59 2015 -0700
Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu
Committed: Tue Aug 18 09:11:22 2015 -0700

--
 R/pkg/NAMESPACE | 50 +++---
 1 file changed, 47 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/20a760a0/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index fd9dfdf..607aef2 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -87,48 +87,86 @@ exportMethods(abs,
   alias,
   approxCountDistinct,
   asc,
+  ascii,
   asin,
   atan,
   atan2,
   avg,
+  base64,
   between,
+  bin,
+  bitwiseNOT,
   cast,
   cbrt,
+  ceil,
   ceiling,
+  concat,
   contains,
   cos,
   cosh,
-  concat,
+  count,
   countDistinct,
+  crc32,
+  datediff,
+  dayofmonth,
+  dayofyear,
   desc,
   endsWith,
   exp,
+  explode,
   expm1,
+  factorial,
+  first,
   floor,
   getField,
   getItem,
   greatest,
+  hex,
+  hour,
   hypot,
+  initcap,
+  isNaN,
   isNotNull,
   isNull,
-  lit,
   last,
+  last_day,
   least,
+  length,
+  levenshtein,
   like,
+  lit,
   log,
   log10,
   log1p,
+  log2,
   lower,
+  ltrim,
   max,
+  md5,
   mean,
   min,
+  minute,
+  month,
+  months_between,
   n,
   n_distinct,
+  nanvl,
+  negate,
+  pmod,
+  quarter,
+  reverse,
   rint,
   rlike,
+  round,
+  rtrim,
+  second,
+  sha1,
   sign,
+  signum,
   sin,
   sinh,
+  size,
+  soundex,
   sqrt,
   startsWith,
   substr,
@@ -138,7 +176,13 @@ exportMethods(abs,
   tanh,
   toDegrees,
   toRadians,
-  upper)
+  to_date,
+  trim,
+  unbase64,
+  unhex,
+  upper,
+  weekofyear,
+  year)
 
 exportClasses(GroupedData)
 exportMethods(agg)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

62 matches

Mail list logo