spark git commit: [MINOR] fix the comments in IndexShuffleBlockResolver
Repository: spark Updated Branches: refs/heads/master dd0614fd6 - c34e9ff0e [MINOR] fix the comments in IndexShuffleBlockResolver it might be a typo introduced at the first moment or some leftover after some renaming.. the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat zhunans...@gmail.com Closes #8238 from CodingCat/minor_1. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c34e9ff0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c34e9ff0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c34e9ff0 Branch: refs/heads/master Commit: c34e9ff0eac2032283b959fe63b47cc30f28d21c Parents: dd0614f Author: CodingCat zhunans...@gmail.com Authored: Tue Aug 18 10:31:11 2015 +0100 Committer: Sean Owen so...@cloudera.com Committed: Tue Aug 18 10:31:11 2015 +0100 -- .../scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c34e9ff0/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala -- diff --git a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala index fae6955..d0163d3 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala @@ -71,7 +71,7 @@ private[spark] class IndexShuffleBlockResolver(conf: SparkConf) extends ShuffleB /** * Write an index file with the offsets of each block, plus a final offset at the end for the - * end of the output file. This will be used by getBlockLocation to figure out where each block + * end of the output file. This will be used by getBlockData to figure out where each block * begins and ends. * */ def writeIndexFile(shuffleId: Int, mapId: Int, lengths: Array[Long]): Unit = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] fix the comments in IndexShuffleBlockResolver
Repository: spark Updated Branches: refs/heads/branch-1.5 40b89c38a - 42a0b4890 [MINOR] fix the comments in IndexShuffleBlockResolver it might be a typo introduced at the first moment or some leftover after some renaming.. the name of the method accessing the index file is called `getBlockData` now (not `getBlockLocation` as indicated in the comments) Author: CodingCat zhunans...@gmail.com Closes #8238 from CodingCat/minor_1. (cherry picked from commit c34e9ff0eac2032283b959fe63b47cc30f28d21c) Signed-off-by: Sean Owen so...@cloudera.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/42a0b489 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/42a0b489 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/42a0b489 Branch: refs/heads/branch-1.5 Commit: 42a0b4890a2b91a9104541b10bab7652ceeb3bd2 Parents: 40b89c3 Author: CodingCat zhunans...@gmail.com Authored: Tue Aug 18 10:31:11 2015 +0100 Committer: Sean Owen so...@cloudera.com Committed: Tue Aug 18 10:31:25 2015 +0100 -- .../scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/42a0b489/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala -- diff --git a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala index fae6955..d0163d3 100644 --- a/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala +++ b/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala @@ -71,7 +71,7 @@ private[spark] class IndexShuffleBlockResolver(conf: SparkConf) extends ShuffleB /** * Write an index file with the offsets of each block, plus a final offset at the end for the - * end of the output file. This will be used by getBlockLocation to figure out where each block + * end of the output file. This will be used by getBlockData to figure out where each block * begins and ends. * */ def writeIndexFile(shuffleId: Int, mapId: Int, lengths: Array[Long]): Unit = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData
Repository: spark Updated Branches: refs/heads/branch-1.5 2803e8b2e - e5fbe4f24 [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu dav...@databricks.com Closes #8250 from davies/array_binary. (cherry picked from commit 5af3838d2e59ed83766f85634e26918baa53819f) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e5fbe4f2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e5fbe4f2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e5fbe4f2 Branch: refs/heads/branch-1.5 Commit: e5fbe4f24aa805d78546e5a11122aa60dde83709 Parents: 2803e8b Author: Davies Liu dav...@databricks.com Authored: Mon Aug 17 23:27:55 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 17 23:28:02 2015 -0700 -- .../codegen/GenerateUnsafeProjection.scala | 12 --- .../codegen/GeneratedProjectionSuite.scala | 21 +++- 2 files changed, 29 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e5fbe4f2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala index b2fb913..b570fe8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala @@ -224,7 +224,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro // go through the input array to calculate how many bytes we need. val calculateNumBytes = elementType match { - case _ if (ctx.isPrimitiveType(elementType)) = + case _ if ctx.isPrimitiveType(elementType) = // Should we do word align? val elementSize = elementType.defaultSize s @@ -237,6 +237,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ = val writer = getWriter(elementType) val elementSize = s$writer.getSize($elements[$index]) +// TODO(davies): avoid the copy val unsafeType = elementType match { case _: StructType = UnsafeRow case _: ArrayType = UnsafeArrayData @@ -249,8 +250,13 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ = } +val newElements = if (elementType == BinaryType) { + snew byte[$numElements][] +} else { + snew $unsafeType[$numElements] +} s - final $unsafeType[] $elements = new $unsafeType[$numElements]; + final $unsafeType[] $elements = $newElements; for (int $index = 0; $index $numElements; $index++) { ${convertedElement.code} if (!${convertedElement.isNull}) { @@ -262,7 +268,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro } val writeElement = elementType match { - case _ if (ctx.isPrimitiveType(elementType)) = + case _ if ctx.isPrimitiveType(elementType) = // Should we do word align? val elementSize = elementType.defaultSize s http://git-wip-us.apache.org/repos/asf/spark/blob/e5fbe4f2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala index 8c7ee87..098944a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen import org.apache.spark.SparkFunSuite import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType} +import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String /** @@
spark git commit: [MINOR] Format the comment of `translate` at `functions.scala`
Repository: spark Updated Branches: refs/heads/branch-1.5 35542504c - 2803e8b2e [MINOR] Format the comment of `translate` at `functions.scala` Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8265 from yu-iskw/minor-translate-comment. (cherry picked from commit a0910315dae88b033e38a1de07f39ca21f6552ad) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2803e8b2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2803e8b2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2803e8b2 Branch: refs/heads/branch-1.5 Commit: 2803e8b2ee21d74ff9e29f2256d401d63f2b5fef Parents: 3554250 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Mon Aug 17 23:27:11 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 17 23:27:19 2015 -0700 -- .../scala/org/apache/spark/sql/functions.scala | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2803e8b2/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 79c5f59..435e631 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1863,14 +1863,15 @@ object functions { def substring_index(str: Column, delim: String, count: Int): Column = SubstringIndex(str.expr, lit(delim).expr, lit(count).expr) - /* Translate any character in the src by a character in replaceString. - * The characters in replaceString is corresponding to the characters in matchingString. - * The translate will happen when any character in the string matching with the character - * in the matchingString. - * - * @group string_funcs - * @since 1.5.0 - */ + /** + * Translate any character in the src by a character in replaceString. + * The characters in replaceString is corresponding to the characters in matchingString. + * The translate will happen when any character in the string matching with the character + * in the matchingString. + * + * @group string_funcs + * @since 1.5.0 + */ def translate(src: Column, matchingString: String, replaceString: String): Column = StringTranslate(src.expr, lit(matchingString).expr, lit(replaceString).expr) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] Format the comment of `translate` at `functions.scala`
Repository: spark Updated Branches: refs/heads/master e290029a3 - a0910315d [MINOR] Format the comment of `translate` at `functions.scala` Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8265 from yu-iskw/minor-translate-comment. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a0910315 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a0910315 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a0910315 Branch: refs/heads/master Commit: a0910315dae88b033e38a1de07f39ca21f6552ad Parents: e290029 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Mon Aug 17 23:27:11 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 17 23:27:11 2015 -0700 -- .../scala/org/apache/spark/sql/functions.scala | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a0910315/sql/core/src/main/scala/org/apache/spark/sql/functions.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala index 79c5f59..435e631 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/functions.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/functions.scala @@ -1863,14 +1863,15 @@ object functions { def substring_index(str: Column, delim: String, count: Int): Column = SubstringIndex(str.expr, lit(delim).expr, lit(count).expr) - /* Translate any character in the src by a character in replaceString. - * The characters in replaceString is corresponding to the characters in matchingString. - * The translate will happen when any character in the string matching with the character - * in the matchingString. - * - * @group string_funcs - * @since 1.5.0 - */ + /** + * Translate any character in the src by a character in replaceString. + * The characters in replaceString is corresponding to the characters in matchingString. + * The translate will happen when any character in the string matching with the character + * in the matchingString. + * + * @group string_funcs + * @since 1.5.0 + */ def translate(src: Column, matchingString: String, replaceString: String): Column = StringTranslate(src.expr, lit(matchingString).expr, lit(replaceString).expr) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData
Repository: spark Updated Branches: refs/heads/master a0910315d - 5af3838d2 [SPARK-10038] [SQL] fix bug in generated unsafe projection when there is binary in ArrayData The type for array of array in Java is slightly different than array of others. cc cloud-fan Author: Davies Liu dav...@databricks.com Closes #8250 from davies/array_binary. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5af3838d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5af3838d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5af3838d Branch: refs/heads/master Commit: 5af3838d2e59ed83766f85634e26918baa53819f Parents: a091031 Author: Davies Liu dav...@databricks.com Authored: Mon Aug 17 23:27:55 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 17 23:27:55 2015 -0700 -- .../codegen/GenerateUnsafeProjection.scala | 12 --- .../codegen/GeneratedProjectionSuite.scala | 21 +++- 2 files changed, 29 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5af3838d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala index b2fb913..b570fe8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala @@ -224,7 +224,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro // go through the input array to calculate how many bytes we need. val calculateNumBytes = elementType match { - case _ if (ctx.isPrimitiveType(elementType)) = + case _ if ctx.isPrimitiveType(elementType) = // Should we do word align? val elementSize = elementType.defaultSize s @@ -237,6 +237,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ = val writer = getWriter(elementType) val elementSize = s$writer.getSize($elements[$index]) +// TODO(davies): avoid the copy val unsafeType = elementType match { case _: StructType = UnsafeRow case _: ArrayType = UnsafeArrayData @@ -249,8 +250,13 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro case _ = } +val newElements = if (elementType == BinaryType) { + snew byte[$numElements][] +} else { + snew $unsafeType[$numElements] +} s - final $unsafeType[] $elements = new $unsafeType[$numElements]; + final $unsafeType[] $elements = $newElements; for (int $index = 0; $index $numElements; $index++) { ${convertedElement.code} if (!${convertedElement.isNull}) { @@ -262,7 +268,7 @@ object GenerateUnsafeProjection extends CodeGenerator[Seq[Expression], UnsafePro } val writeElement = elementType match { - case _ if (ctx.isPrimitiveType(elementType)) = + case _ if ctx.isPrimitiveType(elementType) = // Should we do word align? val elementSize = elementType.defaultSize s http://git-wip-us.apache.org/repos/asf/spark/blob/5af3838d/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala index 8c7ee87..098944a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GeneratedProjectionSuite.scala @@ -20,7 +20,7 @@ package org.apache.spark.sql.catalyst.expressions.codegen import org.apache.spark.SparkFunSuite import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType} +import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String /** @@ -79,4 +79,23 @@ class GeneratedProjectionSuite extends SparkFunSuite { val row2 = mutableProj(result)
spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public
Repository: spark Updated Branches: refs/heads/master 5af3838d2 - dd0614fd6 [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang yblia...@gmail.com Closes #8263 from yanboliang/mlp-public. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd0614fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd0614fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd0614fd Branch: refs/heads/master Commit: dd0614fd618ad28cb77aecfbd49bb319b98fdba0 Parents: 5af3838 Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 17 23:57:02 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 23:57:02 2015 -0700 -- .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dd0614fd/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index c154561..ccca4ec 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: String) @Experimental class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, -layers: Array[Int], -weights: Vector) +val layers: Array[Int], +val weights: Vector) extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public
Repository: spark Updated Branches: refs/heads/branch-1.5 e5fbe4f24 - 40b89c38a [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang yblia...@gmail.com Closes #8263 from yanboliang/mlp-public. (cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b89c38 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b89c38 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b89c38 Branch: refs/heads/branch-1.5 Commit: 40b89c38ada5edfdd1478dc8f3c983ebcbcc56d5 Parents: e5fbe4f Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 17 23:57:02 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 23:57:14 2015 -0700 -- .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/40b89c38/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index c154561..ccca4ec 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: String) @Experimental class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, -layers: Array[Int], -weights: Vector) +val layers: Array[Int], +val weights: Vector) extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J
Repository: spark Updated Branches: refs/heads/master c34e9ff0e - 5723d26d7 [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian l...@databricks.com Closes #8196 from liancheng/spark-8118/redirect-parquet-jul. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5723d26d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5723d26d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5723d26d Branch: refs/heads/master Commit: 5723d26d7e677b89383de3fcf2c9a821b68a65b7 Parents: c34e9ff Author: Cheng Lian l...@databricks.com Authored: Tue Aug 18 20:15:33 2015 +0800 Committer: Cheng Lian l...@databricks.com Committed: Tue Aug 18 20:15:33 2015 +0800 -- conf/log4j.properties.template | 2 + .../datasources/parquet/ParquetRelation.scala | 77 ++-- .../parquet/ParquetTableSupport.scala | 1 - .../parquet/ParquetTypesConverter.scala | 3 - sql/hive/src/test/resources/log4j.properties| 7 +- 5 files changed, 47 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5723d26d/conf/log4j.properties.template -- diff --git a/conf/log4j.properties.template b/conf/log4j.properties.template index 27006e4..74c5cea 100644 --- a/conf/log4j.properties.template +++ b/conf/log4j.properties.template @@ -10,6 +10,8 @@ log4j.logger.org.spark-project.jetty=WARN log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO +log4j.logger.org.apache.parquet=ERROR +log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL http://git-wip-us.apache.org/repos/asf/spark/blob/5723d26d/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala index 52fac18..68169d4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala @@ -18,7 +18,7 @@ package org.apache.spark.sql.execution.datasources.parquet import java.net.URI -import java.util.logging.{Level, Logger = JLogger} +import java.util.logging.{Logger = JLogger} import java.util.{List = JList} import scala.collection.JavaConversions._ @@ -31,22 +31,22 @@ import org.apache.hadoop.io.Writable import org.apache.hadoop.mapreduce._ import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.parquet.filter2.predicate.FilterApi +import org.apache.parquet.hadoop._ import org.apache.parquet.hadoop.metadata.CompressionCodecName import org.apache.parquet.hadoop.util.ContextUtil -import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetRecordReader, _} import org.apache.parquet.schema.MessageType -import org.apache.parquet.{Log = ParquetLog} +import org.apache.parquet.{Log = ApacheParquetLog} +import org.slf4j.bridge.SLF4JBridgeHandler -import org.apache.spark.{Logging, Partition = SparkPartition, SparkException} import org.apache.spark.broadcast.Broadcast -import org.apache.spark.rdd.{SqlNewHadoopPartition, SqlNewHadoopRDD, RDD} -import org.apache.spark.rdd.RDD._ +import org.apache.spark.rdd.{RDD, SqlNewHadoopPartition, SqlNewHadoopRDD} import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.execution.datasources.PartitionSpec import org.apache.spark.sql.sources._ import org.apache.spark.sql.types.{DataType, StructType} import org.apache.spark.util.{SerializableConfiguration, Utils} +import org.apache.spark.{Logging, Partition = SparkPartition, SparkException} private[sql] class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister { @@ -759,38 +759,39 @@ private[sql] object ParquetRelation extends Logging { }.toOption } - def enableLogForwarding() { -// Note: the org.apache.parquet.Log class has a static initializer that -
spark git commit: [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J
Repository: spark Updated Branches: refs/heads/branch-1.5 42a0b4890 - a512250cd [SPARK-8118] [SQL] Redirects Parquet JUL logger via SLF4J Parquet hard coded a JUL logger which always writes to stdout. This PR redirects it via SLF4j JUL bridge handler, so that we can control Parquet logs via `log4j.properties`. This solution is inspired by https://github.com/Parquet/parquet-mr/issues/390#issuecomment-46064909. Author: Cheng Lian l...@databricks.com Closes #8196 from liancheng/spark-8118/redirect-parquet-jul. (cherry picked from commit 5723d26d7e677b89383de3fcf2c9a821b68a65b7) Signed-off-by: Cheng Lian l...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a512250c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a512250c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a512250c Branch: refs/heads/branch-1.5 Commit: a512250cd19288a7ad8fb600d06544f8728b2dd1 Parents: 42a0b48 Author: Cheng Lian l...@databricks.com Authored: Tue Aug 18 20:15:33 2015 +0800 Committer: Cheng Lian l...@databricks.com Committed: Tue Aug 18 20:16:13 2015 +0800 -- conf/log4j.properties.template | 2 + .../datasources/parquet/ParquetRelation.scala | 77 ++-- .../parquet/ParquetTableSupport.scala | 1 - .../parquet/ParquetTypesConverter.scala | 3 - sql/hive/src/test/resources/log4j.properties| 7 +- 5 files changed, 47 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a512250c/conf/log4j.properties.template -- diff --git a/conf/log4j.properties.template b/conf/log4j.properties.template index 27006e4..74c5cea 100644 --- a/conf/log4j.properties.template +++ b/conf/log4j.properties.template @@ -10,6 +10,8 @@ log4j.logger.org.spark-project.jetty=WARN log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO +log4j.logger.org.apache.parquet=ERROR +log4j.logger.parquet=ERROR # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL http://git-wip-us.apache.org/repos/asf/spark/blob/a512250c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala index 52fac18..68169d4 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala @@ -18,7 +18,7 @@ package org.apache.spark.sql.execution.datasources.parquet import java.net.URI -import java.util.logging.{Level, Logger = JLogger} +import java.util.logging.{Logger = JLogger} import java.util.{List = JList} import scala.collection.JavaConversions._ @@ -31,22 +31,22 @@ import org.apache.hadoop.io.Writable import org.apache.hadoop.mapreduce._ import org.apache.hadoop.mapreduce.lib.input.FileInputFormat import org.apache.parquet.filter2.predicate.FilterApi +import org.apache.parquet.hadoop._ import org.apache.parquet.hadoop.metadata.CompressionCodecName import org.apache.parquet.hadoop.util.ContextUtil -import org.apache.parquet.hadoop.{ParquetOutputCommitter, ParquetRecordReader, _} import org.apache.parquet.schema.MessageType -import org.apache.parquet.{Log = ParquetLog} +import org.apache.parquet.{Log = ApacheParquetLog} +import org.slf4j.bridge.SLF4JBridgeHandler -import org.apache.spark.{Logging, Partition = SparkPartition, SparkException} import org.apache.spark.broadcast.Broadcast -import org.apache.spark.rdd.{SqlNewHadoopPartition, SqlNewHadoopRDD, RDD} -import org.apache.spark.rdd.RDD._ +import org.apache.spark.rdd.{RDD, SqlNewHadoopPartition, SqlNewHadoopRDD} import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.execution.datasources.PartitionSpec import org.apache.spark.sql.sources._ import org.apache.spark.sql.types.{DataType, StructType} import org.apache.spark.util.{SerializableConfiguration, Utils} +import org.apache.spark.{Logging, Partition = SparkPartition, SparkException} private[sql] class DefaultSource extends HadoopFsRelationProvider with DataSourceRegister { @@ -759,38 +759,39 @@ private[sql] object ParquetRelation extends Logging {
spark git commit: [SPARK-7736] [CORE] Fix a race introduced in PythonRunner.
Repository: spark Updated Branches: refs/heads/master 354f4582b - c1840a862 [SPARK-7736] [CORE] Fix a race introduced in PythonRunner. The fix for SPARK-7736 introduced a race where a port value of -1 could be passed down to the pyspark process, causing it to fail to connect back to the JVM. This change adds code to fix that race. Author: Marcelo Vanzin van...@cloudera.com Closes #8258 from vanzin/SPARK-7736. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c1840a86 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c1840a86 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c1840a86 Branch: refs/heads/master Commit: c1840a862eb548bc4306e53ee7e9f26986b31832 Parents: 354f458 Author: Marcelo Vanzin van...@cloudera.com Authored: Tue Aug 18 11:36:36 2015 -0700 Committer: Marcelo Vanzin van...@cloudera.com Committed: Tue Aug 18 11:36:36 2015 -0700 -- .../main/scala/org/apache/spark/deploy/PythonRunner.scala| 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c1840a86/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala b/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala index 4277ac2..23d01e9 100644 --- a/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala +++ b/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala @@ -52,10 +52,16 @@ object PythonRunner { gatewayServer.start() } }) -thread.setName(py4j-gateway) +thread.setName(py4j-gateway-init) thread.setDaemon(true) thread.start() +// Wait until the gateway server has started, so that we know which port is it bound to. +// `gatewayServer.start()` will start a new thread and run the server code there, after +// initializing the socket, so the thread started above will end as soon as the server is +// ready to serve connections. +thread.join() + // Build up a PYTHONPATH that includes the Spark assembly JAR (where this class is), the // python directories in SPARK_HOME (if set), and any files in the pyFiles argument val pathElements = new ArrayBuffer[String] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel
Repository: spark Updated Branches: refs/heads/branch-1.5 20a760a00 - b86378cf2 [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang hhb...@gmail.com Author: Joseph K. Bradley jos...@databricks.com Closes #7388 from hhbyyh/cvEstimator. (cherry picked from commit 354f4582b637fa25d3892ec2b12869db50ed83c9) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b86378cf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b86378cf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b86378cf Branch: refs/heads/branch-1.5 Commit: b86378cf29f8fdb70e41b2f04d831b8a15c1c859 Parents: 20a760a Author: Yuhao Yang hhb...@gmail.com Authored: Tue Aug 18 11:00:09 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 11:00:22 2015 -0700 -- .../spark/ml/feature/CountVectorizer.scala | 235 +++ .../spark/ml/feature/CountVectorizerModel.scala | 82 --- .../spark/ml/feature/CountVectorizerSuite.scala | 167 + .../spark/ml/feature/CountVectorizorSuite.scala | 73 -- 4 files changed, 402 insertions(+), 155 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b86378cf/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala new file mode 100644 index 000..49028e4 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.util.collection.OpenHashMap + +/** + * Params for [[CountVectorizer]] and [[CountVectorizerModel]]. + */ +private[feature] trait CountVectorizerParams extends Params with HasInputCol with HasOutputCol { + + /** + * Max size of the vocabulary. + * CountVectorizer will build a vocabulary that only considers the top + * vocabSize terms ordered by term frequency across the corpus. + * + * Default: 2^18^ + * @group param + */ + val vocabSize: IntParam = +new IntParam(this, vocabSize, max size of the vocabulary, ParamValidators.gt(0)) + + /** @group getParam */ + def getVocabSize: Int = $(vocabSize) + + /** + * Specifies the minimum number of different documents a term must appear in to be included + * in the vocabulary. + * If this is an integer = 1, this specifies the number of documents the term must appear in; + * if this is a double in [0,1), then this specifies the fraction of documents. + * + * Default: 1 + * @group param + */ + val minDF: DoubleParam = new DoubleParam(this, minDF, Specifies the minimum number of + + different documents a term must appear in to be included in the vocabulary. + + If this is an integer = 1, this specifies the number of documents the term must + + appear in; if this is a double in [0,1), then this specifies the fraction of
spark git commit: [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel
Repository: spark Updated Branches: refs/heads/master 1968276af - 354f4582b [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang hhb...@gmail.com Author: Joseph K. Bradley jos...@databricks.com Closes #7388 from hhbyyh/cvEstimator. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/354f4582 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/354f4582 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/354f4582 Branch: refs/heads/master Commit: 354f4582b637fa25d3892ec2b12869db50ed83c9 Parents: 1968276 Author: Yuhao Yang hhb...@gmail.com Authored: Tue Aug 18 11:00:09 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 11:00:09 2015 -0700 -- .../spark/ml/feature/CountVectorizer.scala | 235 +++ .../spark/ml/feature/CountVectorizerModel.scala | 82 --- .../spark/ml/feature/CountVectorizerSuite.scala | 167 + .../spark/ml/feature/CountVectorizorSuite.scala | 73 -- 4 files changed, 402 insertions(+), 155 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/354f4582/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala new file mode 100644 index 000..49028e4 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala @@ -0,0 +1,235 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util.{Identifiable, SchemaUtils} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.mllib.linalg.{VectorUDT, Vectors} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.util.collection.OpenHashMap + +/** + * Params for [[CountVectorizer]] and [[CountVectorizerModel]]. + */ +private[feature] trait CountVectorizerParams extends Params with HasInputCol with HasOutputCol { + + /** + * Max size of the vocabulary. + * CountVectorizer will build a vocabulary that only considers the top + * vocabSize terms ordered by term frequency across the corpus. + * + * Default: 2^18^ + * @group param + */ + val vocabSize: IntParam = +new IntParam(this, vocabSize, max size of the vocabulary, ParamValidators.gt(0)) + + /** @group getParam */ + def getVocabSize: Int = $(vocabSize) + + /** + * Specifies the minimum number of different documents a term must appear in to be included + * in the vocabulary. + * If this is an integer = 1, this specifies the number of documents the term must appear in; + * if this is a double in [0,1), then this specifies the fraction of documents. + * + * Default: 1 + * @group param + */ + val minDF: DoubleParam = new DoubleParam(this, minDF, Specifies the minimum number of + + different documents a term must appear in to be included in the vocabulary. + + If this is an integer = 1, this specifies the number of documents the term must + + appear in; if this is a double in [0,1), then this specifies the fraction of documents., +ParamValidators.gtEq(0.0)) + + /** @group getParam */ + def getMinDF: Double = $(minDF) + + /** Validates and transforms
spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules
Repository: spark Updated Branches: refs/heads/branch-1.5 b86378cf2 - 7ff0e5d2f [SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang fli...@databricks.com Closes #8207 from feynmanliang/SPARK-9900-arules. (cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff0e5d2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff0e5d2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff0e5d2 Branch: refs/heads/branch-1.5 Commit: 7ff0e5d2fe07d4a9518ade26b09bcc32f418ca1b Parents: b86378c Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 12:53:57 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:54:05 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 130 +-- docs/mllib-guide.md | 1 + .../mllib/fpm/JavaAssociationRulesSuite.java| 2 +- 3 files changed, 118 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ff0e5d2/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index 8ea4389..6c06550 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following (hyper-)parameters: div class=codetabs div data-lang=scala markdown=1 -[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the -FP-growth algorithm. -It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type. +[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) +implements the FP-growth algorithm. It take an `RDD` of transactions, +where each transaction is an `Iterable` of items of a generic type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel} -val transactions: RDD[Array[String]] = ... +val transactions: RDD[Array[String]] = sc.parallelize(Seq( + r z h k p, + z y x w v u t s, + s x o n r, + x z y m t s q e, + z, + x z y r q t p) + .map(_.split( ))) val fpg = new FPGrowth() .setMinSupport(0.2) @@ -60,29 +72,48 @@ val model = fpg.run(transactions) model.freqItemsets.collect().foreach { itemset = println(itemset.items.mkString([, ,, ]) + , + itemset.freq) } + +val minConfidence = 0.8 +model.generateAssociationRules(minConfidence).collect().foreach { rule = + println( +rule.antecedent.mkString([, ,, ]) + + = + rule.consequent .mkString([, ,, ]) + + , + rule.confidence) +} {% endhighlight %} /div div data-lang=java markdown=1 -[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the -FP-growth algorithm. -It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type. -Calling `FPGrowth.run` with transactions returns an +[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) +implements the FP-growth algorithm. It take a `JavaRDD` of +transactions, where each transaction is an `Array` of items of a generic +type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. {% highlight java %} +import java.util.Arrays; import java.util.List; -import com.google.common.base.Joiner; - import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.fpm.AssociationRules; import org.apache.spark.mllib.fpm.FPGrowth; import org.apache.spark.mllib.fpm.FPGrowthModel; -JavaRDDListString transactions = ... +JavaRDDListString transactions = sc.parallelize(Arrays.asList( + Arrays.asList(r z h
spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import
Repository: spark Updated Branches: refs/heads/branch-1.5 ec7079f9c - 9bd2e6f7c [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal pmig...@gmail.com Closes #8284 from stared/spark-10085. (cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bd2e6f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bd2e6f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bd2e6f7 Branch: refs/heads/branch-1.5 Commit: 9bd2e6f7cbff1835f9abefe26dbe445eaa5b004b Parents: ec7079f Author: Piotr Migdal pmig...@gmail.com Authored: Tue Aug 18 12:59:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:59:36 2015 -0700 -- docs/mllib-linear-methods.md | 2 -- 1 file changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9bd2e6f7/docs/mllib-linear-methods.md -- diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index 07655ba..e9b2d27 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -504,7 +504,6 @@ will in the future. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import LabeledPoint -from numpy import array # Load and parse the data def parsePoint(line): @@ -676,7 +675,6 @@ Note that the Python API does not yet support model save/load but will in the fu {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel -from numpy import array # Load and parse the data def parsePoint(line): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import
Repository: spark Updated Branches: refs/heads/master 747c2ba80 - 8bae9015b [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal pmig...@gmail.com Closes #8284 from stared/spark-10085. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bae9015 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bae9015 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bae9015 Branch: refs/heads/master Commit: 8bae9015b7e7b4528ca2bc5180771cb95d2aac13 Parents: 747c2ba Author: Piotr Migdal pmig...@gmail.com Authored: Tue Aug 18 12:59:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:59:28 2015 -0700 -- docs/mllib-linear-methods.md | 2 -- 1 file changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8bae9015/docs/mllib-linear-methods.md -- diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index 07655ba..e9b2d27 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -504,7 +504,6 @@ will in the future. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import LabeledPoint -from numpy import array # Load and parse the data def parsePoint(line): @@ -676,7 +675,6 @@ Note that the Python API does not yet support model save/load but will in the fu {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel -from numpy import array # Load and parse the data def parsePoint(line): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide
Repository: spark Updated Branches: refs/heads/master f5ea39129 - f4fa61eff [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang yblia...@gmail.com Closes #8225 from yanboliang/spark-10029. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fa61ef Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fa61ef Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fa61ef Branch: refs/heads/master Commit: f4fa61effe34dae2f0eab0bef57b2dee220cf92f Parents: f5ea391 Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:55:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:55:36 2015 -0700 -- docs/mllib-isotonic-regression.md | 35 ++ 1 file changed, 35 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f4fa61ef/docs/mllib-isotonic-regression.md -- diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 5732bc4..6aa881f 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath); IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), myModelPath); {% endhighlight %} /div + +div data-lang=python markdown=1 +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +{% highlight python %} +import math +from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel + +data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt) + +# Create label, feature, weight tuples from input data with weight set to default value 1.0. +parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) + (1.0,)) + +# Split data into training (60%) and test (40%) sets. +training, test = parsedData.randomSplit([0.6, 0.4], 11) + +# Create isotonic regression model from training data. +# Isotonic parameter defaults to true so it is only shown for demonstration +model = IsotonicRegression.train(training) + +# Create tuples of predicted and real labels. +predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0])) + +# Calculate mean squared error between predicted and real labels. +meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 2)).mean() +print(Mean Squared Error = + str(meanSquaredError)) + +# Save and load model +model.save(sc, myModelPath) +sameModel = IsotonicRegressionModel.load(sc, myModelPath) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation
Repository: spark Updated Branches: refs/heads/master bf1d6614d - 80cb25b22 [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust mich...@databricks.com Closes #8281 from marmbrus/binaryCompat. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80cb25b2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80cb25b2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80cb25b2 Branch: refs/heads/master Commit: 80cb25b228e821a80256546a2f03f73a45cf7645 Parents: bf1d661 Author: Michael Armbrust mich...@databricks.com Authored: Tue Aug 18 13:50:51 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 13:50:51 2015 -0700 -- .../main/scala/org/apache/spark/sql/SQLContext.scala| 11 +++ .../main/scala/org/apache/spark/sql/SQLImplicits.scala | 10 -- .../org/apache/spark/sql/test/SharedSQLContext.scala| 12 +++- 3 files changed, 22 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 53de10d..58fe75b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -334,6 +334,17 @@ class SQLContext(@transient val sparkContext: SparkContext) @Experimental object implicits extends SQLImplicits with Serializable { protected override def _sqlContext: SQLContext = self + +/** + * Converts $col name into an [[Column]]. + * @since 1.3.0 + */ +// This must live here to preserve binary compatibility with Spark 1.5. +implicit class StringToColumn(val sc: StringContext) { + def $(args: Any*): ColumnName = { +new ColumnName(sc.s(args: _*)) + } +} } // scalastyle:on http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala index 5f82372..47b6f80 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala @@ -34,16 +34,6 @@ private[sql] abstract class SQLImplicits { protected def _sqlContext: SQLContext /** - * Converts $col name into an [[Column]]. - * @since 1.3.0 - */ - implicit class StringToColumn(val sc: StringContext) { -def $(args: Any*): ColumnName = { - new ColumnName(sc.s(args: _*)) -} - } - - /** * An implicit conversion that turns a Scala `Symbol` into a [[Column]]. * @since 1.3.0 */ http://git-wip-us.apache.org/repos/asf/spark/blob/80cb25b2/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala index 3cfd822..8a061b6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.test -import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.{ColumnName, SQLContext} /** @@ -65,4 +65,14 @@ private[sql] trait SharedSQLContext extends SQLTestUtils { } } + /** + * Converts $col name into an [[Column]]. + * @since 1.3.0 + */ + // This must be duplicated here to preserve binary compatibility with Spark 1.5. + implicit class StringToColumn(val sc: StringContext) { +def $(args: Any*): ColumnName = { + new ColumnName(sc.s(args: _*)) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation
Repository: spark Updated Branches: refs/heads/branch-1.5 2bccd918f - 80a6fb521 [SPARK-10080] [SQL] Fix binary incompatibility for $ column interpolation Turns out that inner classes of inner objects are referenced directly, and thus moving it will break binary compatibility. Author: Michael Armbrust mich...@databricks.com Closes #8281 from marmbrus/binaryCompat. (cherry picked from commit 80cb25b228e821a80256546a2f03f73a45cf7645) Signed-off-by: Michael Armbrust mich...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80a6fb52 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80a6fb52 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80a6fb52 Branch: refs/heads/branch-1.5 Commit: 80a6fb521a2eb9eadc3529f81a74585f7aaf5e26 Parents: 2bccd91 Author: Michael Armbrust mich...@databricks.com Authored: Tue Aug 18 13:50:51 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 13:51:03 2015 -0700 -- .../main/scala/org/apache/spark/sql/SQLContext.scala| 11 +++ .../main/scala/org/apache/spark/sql/SQLImplicits.scala | 10 -- .../org/apache/spark/sql/test/SharedSQLContext.scala| 12 +++- 3 files changed, 22 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala index 53de10d..58fe75b 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala @@ -334,6 +334,17 @@ class SQLContext(@transient val sparkContext: SparkContext) @Experimental object implicits extends SQLImplicits with Serializable { protected override def _sqlContext: SQLContext = self + +/** + * Converts $col name into an [[Column]]. + * @since 1.3.0 + */ +// This must live here to preserve binary compatibility with Spark 1.5. +implicit class StringToColumn(val sc: StringContext) { + def $(args: Any*): ColumnName = { +new ColumnName(sc.s(args: _*)) + } +} } // scalastyle:on http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala index 5f82372..47b6f80 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala @@ -34,16 +34,6 @@ private[sql] abstract class SQLImplicits { protected def _sqlContext: SQLContext /** - * Converts $col name into an [[Column]]. - * @since 1.3.0 - */ - implicit class StringToColumn(val sc: StringContext) { -def $(args: Any*): ColumnName = { - new ColumnName(sc.s(args: _*)) -} - } - - /** * An implicit conversion that turns a Scala `Symbol` into a [[Column]]. * @since 1.3.0 */ http://git-wip-us.apache.org/repos/asf/spark/blob/80a6fb52/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala index 3cfd822..8a061b6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala @@ -17,7 +17,7 @@ package org.apache.spark.sql.test -import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.{ColumnName, SQLContext} /** @@ -65,4 +65,14 @@ private[sql] trait SharedSQLContext extends SQLTestUtils { } } + /** + * Converts $col name into an [[Column]]. + * @since 1.3.0 + */ + // This must be duplicated here to preserve binary compatibility with Spark 1.5. + implicit class StringToColumn(val sc: StringContext) { +def $(args: Any*): ColumnName = { + new ColumnName(sc.s(args: _*)) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt
Repository: spark Updated Branches: refs/heads/branch-1.5 56f4da263 - fb207b245 [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe lewua...@me.com Closes #8223 from Lewuathe/SPARK-10012. (cherry picked from commit c635a16f64c939182196b46725ef2d00ed107cca) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fb207b24 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fb207b24 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fb207b24 Branch: refs/heads/branch-1.5 Commit: fb207b245305b30b4fe47e08f98f2571a2d05249 Parents: 56f4da2 Author: lewuathe lewua...@me.com Authored: Tue Aug 18 15:30:23 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 15:30:34 2015 -0700 -- mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fb207b24/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala index be95638..2c878f8 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala @@ -199,6 +199,9 @@ class ParamsSuite extends SparkFunSuite { val inArray = ParamValidators.inArray[Int](Array(1, 2)) assert(inArray(1) inArray(2) !inArray(0)) + +val arrayLengthGt = ParamValidators.arrayLengthGt[Int](2.0) +assert(arrayLengthGt(Array(0, 1, 2)) !arrayLengthGt(Array(0, 1))) } test(Params.copyValues) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules
Repository: spark Updated Branches: refs/heads/master c1840a862 - f5ea39129 [SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang fli...@databricks.com Closes #8207 from feynmanliang/SPARK-9900-arules. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea3912 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea3912 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea3912 Branch: refs/heads/master Commit: f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b Parents: c1840a8 Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 12:53:57 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:53:57 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 130 +-- docs/mllib-guide.md | 1 + .../mllib/fpm/JavaAssociationRulesSuite.java| 2 +- 3 files changed, 118 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index 8ea4389..6c06550 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following (hyper-)parameters: div class=codetabs div data-lang=scala markdown=1 -[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the -FP-growth algorithm. -It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type. +[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) +implements the FP-growth algorithm. It take an `RDD` of transactions, +where each transaction is an `Iterable` of items of a generic type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel} -val transactions: RDD[Array[String]] = ... +val transactions: RDD[Array[String]] = sc.parallelize(Seq( + r z h k p, + z y x w v u t s, + s x o n r, + x z y m t s q e, + z, + x z y r q t p) + .map(_.split( ))) val fpg = new FPGrowth() .setMinSupport(0.2) @@ -60,29 +72,48 @@ val model = fpg.run(transactions) model.freqItemsets.collect().foreach { itemset = println(itemset.items.mkString([, ,, ]) + , + itemset.freq) } + +val minConfidence = 0.8 +model.generateAssociationRules(minConfidence).collect().foreach { rule = + println( +rule.antecedent.mkString([, ,, ]) + + = + rule.consequent .mkString([, ,, ]) + + , + rule.confidence) +} {% endhighlight %} /div div data-lang=java markdown=1 -[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the -FP-growth algorithm. -It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type. -Calling `FPGrowth.run` with transactions returns an +[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) +implements the FP-growth algorithm. It take a `JavaRDD` of +transactions, where each transaction is an `Array` of items of a generic +type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. {% highlight java %} +import java.util.Arrays; import java.util.List; -import com.google.common.base.Joiner; - import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.fpm.AssociationRules; import org.apache.spark.mllib.fpm.FPGrowth; import org.apache.spark.mllib.fpm.FPGrowthModel; -JavaRDDListString transactions = ... +JavaRDDListString transactions = sc.parallelize(Arrays.asList( + Arrays.asList(r z h k p.split( )), + Arrays.asList(z y x w v u t s.split( )), + Arrays.asList(s x o n r.split( )), + Arrays.asList(x z y m t s
spark git commit: [SPARK-10089] [SQL] Add missing golden files.
Repository: spark Updated Branches: refs/heads/branch-1.5 80a6fb521 - 74a6b1a13 [SPARK-10089] [SQL] Add missing golden files. Author: Marcelo Vanzin van...@cloudera.com Closes #8283 from vanzin/SPARK-10089. (cherry picked from commit fa41e0242f075843beff7dc600d1a6bac004bdc7) Signed-off-by: Michael Armbrust mich...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74a6b1a1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74a6b1a1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74a6b1a1 Branch: refs/heads/branch-1.5 Commit: 74a6b1a131a53a69527d56680c03db9caeefbba7 Parents: 80a6fb5 Author: Marcelo Vanzin van...@cloudera.com Authored: Tue Aug 18 14:43:05 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 14:43:19 2015 -0700 -- ...uery test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 | 3 + ...uery test-0-eabbebd5c1d127b1605bfec52d7b7f3f | 500 +++ 2 files changed, 503 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/74a6b1a1/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 -- diff --git a/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 new file mode 100644 index 000..9a276bc --- /dev/null +++ b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 @@ -0,0 +1,3 @@ +476 +172 +622 http://git-wip-us.apache.org/repos/asf/spark/blob/74a6b1a1/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f -- diff --git a/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f new file mode 100644 index 000..444039e --- /dev/null +++ b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f @@ -0,0 +1,500 @@ +476 +172 +622 +54 +330 +818 +510 +556 +196 +968 +530 +386 +802 +300 +546 +448 +738 +132 +256 +426 +292 +812 +858 +748 +304 +938 +290 +990 +74 +654 +562 +554 +418 +30 +164 +806 +332 +834 +860 +504 +584 +438 +574 +306 +386 +676 +892 +918 +788 +474 +964 +348 +826 +988 +414 +398 +932 +416 +348 +798 +792 +494 +834 +978 +324 +754 +794 +618 +730 +532 +878 +684 +734 +650 +334 +390 +950 +34 +226 +310 +406 +678 +0 +910 +256 +622 +632 +114 +604 +410 +298 +876 +690 +258 +340 +40 +978 +314 +756 +442 +184 +222 +94 +144 +8 +560 +70 +854 +554 +416 +712 +798 +338 +764 +996 +250 +772 +874 +938 +384 +572 +374 +352 +108 +918 +102 +276 +206 +478 +426 +432 +860 +556 +352 +578 +442 +130 +636 +664 +622 +550 +274 +482 +166 +666 +360 +568 +24 +460 +362 +134 +520 +808 +768 +978 +706 +746 +544 +276 +434 +168 +696 +932 +116 +16 +822 +460 +416 +696 +48 +926 +862 +358 +344 +84 +258 +316 +238 +992 +0 +644 +394 +936 +786 +908 +200 +596 +398 +382 +836 +192 +52 +330 +654 +460 +410 +240 +262 +102 +808 +86 +872 +312 +938 +936 +616 +190 +392 +576 +962 +914 +196 +564 +394 +374 +636 +636 +818 +940 +274 +738 +632 +338 +826 +170 +154 +0 +980 +174 +728 +358 +236 +268 +790 +564 +276 +476 +838 +30 +236 +144 +180 +614 +38 +870 +20 +554 +546 +612 +448 +618 +778 +654 +484 +738 +784 +544 +662 +802 +484 +904 +354 +452 +10 +994 +804 +792 +634 +790 +116 +70 +672 +190 +22 +336 +68 +458 +466 +286 +944 +644 +996 +320 +390 +84 +642 +860 +238 +978 +916 +156 +152 +82 +446 +984 +298 +898 +436 +456 +276 +906 +60 +418 +128 +936 +152 +148 +684 +138 +460 +66 +736 +206 +592 +226 +432 +734 +688 +334 +548 +438 +478 +970 +232 +446 +512 +526 +140 +974 +960 +802 +576 +382 +10 +488 +876 +256 +934 +864 +404 +632 +458 +938 +926 +560 +4 +70 +566 +662 +470 +160 +88 +386 +642 +670 +208 +932 +732 +350 +806 +966 +106 +210 +514 +812 +818 +380 +812 +802 +228 +516 +180 +406 +524 +696 +848 +24 +792 +402 +434 +328 +862 +908 +956 +596 +250 +862 +328 +848 +374 +764 +10 +140 +794 +960 +582 +48 +702 +510 +208 +140 +326 +876 +238 +828 +400 +982 +474 +878 +720 +496 +958 +610 +834 +398 +888 +240 +858 +338 +886 +646 +650 +554 +460 +956 +356 +936 +620 +634 +666 +986 +920 +414 +498 +530 +960 +166 +272 +706 +344 +428 +924 +466 +812 +266 +350 +378 +908 +750 +802 +842
spark git commit: [SPARK-10089] [SQL] Add missing golden files.
Repository: spark Updated Branches: refs/heads/master 9b731fad2 - fa41e0242 [SPARK-10089] [SQL] Add missing golden files. Author: Marcelo Vanzin van...@cloudera.com Closes #8283 from vanzin/SPARK-10089. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fa41e024 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fa41e024 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fa41e024 Branch: refs/heads/master Commit: fa41e0242f075843beff7dc600d1a6bac004bdc7 Parents: 9b731fa Author: Marcelo Vanzin van...@cloudera.com Authored: Tue Aug 18 14:43:05 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 14:43:05 2015 -0700 -- ...uery test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 | 3 + ...uery test-0-eabbebd5c1d127b1605bfec52d7b7f3f | 500 +++ 2 files changed, 503 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fa41e024/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 -- diff --git a/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 new file mode 100644 index 000..9a276bc --- /dev/null +++ b/sql/hive/src/test/resources/golden/Column pruning - non-trivial top project with aliases - query test-0-515e406ffb23f6fd0d8cd34c2b25fbe6 @@ -0,0 +1,3 @@ +476 +172 +622 http://git-wip-us.apache.org/repos/asf/spark/blob/fa41e024/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f -- diff --git a/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f new file mode 100644 index 000..444039e --- /dev/null +++ b/sql/hive/src/test/resources/golden/Partition pruning - non-partitioned, non-trivial project - query test-0-eabbebd5c1d127b1605bfec52d7b7f3f @@ -0,0 +1,500 @@ +476 +172 +622 +54 +330 +818 +510 +556 +196 +968 +530 +386 +802 +300 +546 +448 +738 +132 +256 +426 +292 +812 +858 +748 +304 +938 +290 +990 +74 +654 +562 +554 +418 +30 +164 +806 +332 +834 +860 +504 +584 +438 +574 +306 +386 +676 +892 +918 +788 +474 +964 +348 +826 +988 +414 +398 +932 +416 +348 +798 +792 +494 +834 +978 +324 +754 +794 +618 +730 +532 +878 +684 +734 +650 +334 +390 +950 +34 +226 +310 +406 +678 +0 +910 +256 +622 +632 +114 +604 +410 +298 +876 +690 +258 +340 +40 +978 +314 +756 +442 +184 +222 +94 +144 +8 +560 +70 +854 +554 +416 +712 +798 +338 +764 +996 +250 +772 +874 +938 +384 +572 +374 +352 +108 +918 +102 +276 +206 +478 +426 +432 +860 +556 +352 +578 +442 +130 +636 +664 +622 +550 +274 +482 +166 +666 +360 +568 +24 +460 +362 +134 +520 +808 +768 +978 +706 +746 +544 +276 +434 +168 +696 +932 +116 +16 +822 +460 +416 +696 +48 +926 +862 +358 +344 +84 +258 +316 +238 +992 +0 +644 +394 +936 +786 +908 +200 +596 +398 +382 +836 +192 +52 +330 +654 +460 +410 +240 +262 +102 +808 +86 +872 +312 +938 +936 +616 +190 +392 +576 +962 +914 +196 +564 +394 +374 +636 +636 +818 +940 +274 +738 +632 +338 +826 +170 +154 +0 +980 +174 +728 +358 +236 +268 +790 +564 +276 +476 +838 +30 +236 +144 +180 +614 +38 +870 +20 +554 +546 +612 +448 +618 +778 +654 +484 +738 +784 +544 +662 +802 +484 +904 +354 +452 +10 +994 +804 +792 +634 +790 +116 +70 +672 +190 +22 +336 +68 +458 +466 +286 +944 +644 +996 +320 +390 +84 +642 +860 +238 +978 +916 +156 +152 +82 +446 +984 +298 +898 +436 +456 +276 +906 +60 +418 +128 +936 +152 +148 +684 +138 +460 +66 +736 +206 +592 +226 +432 +734 +688 +334 +548 +438 +478 +970 +232 +446 +512 +526 +140 +974 +960 +802 +576 +382 +10 +488 +876 +256 +934 +864 +404 +632 +458 +938 +926 +560 +4 +70 +566 +662 +470 +160 +88 +386 +642 +670 +208 +932 +732 +350 +806 +966 +106 +210 +514 +812 +818 +380 +812 +802 +228 +516 +180 +406 +524 +696 +848 +24 +792 +402 +434 +328 +862 +908 +956 +596 +250 +862 +328 +848 +374 +764 +10 +140 +794 +960 +582 +48 +702 +510 +208 +140 +326 +876 +238 +828 +400 +982 +474 +878 +720 +496 +958 +610 +834 +398 +888 +240 +858 +338 +886 +646 +650 +554 +460 +956 +356 +936 +620 +634 +666 +986 +920 +414 +498 +530 +960 +166 +272 +706 +344 +428 +924 +466 +812 +266 +350 +378 +908 +750 +802 +842 +814 +768 +512 +52 +268 +134 +768 +758 +36 +924 +984 +200 +596 +18 +682 +996 +292 +916 +724 +372 +570 +696 +334 +36 +546 +366 +562
spark git commit: [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt
Repository: spark Updated Branches: refs/heads/master 1dbffba37 - c635a16f6 [SPARK-10012] [ML] Missing test case for Params#arrayLengthGt Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe lewua...@me.com Closes #8223 from Lewuathe/SPARK-10012. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c635a16f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c635a16f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c635a16f Branch: refs/heads/master Commit: c635a16f64c939182196b46725ef2d00ed107cca Parents: 1dbffba Author: lewuathe lewua...@me.com Authored: Tue Aug 18 15:30:23 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 15:30:23 2015 -0700 -- mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala | 3 +++ 1 file changed, 3 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c635a16f/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala index be95638..2c878f8 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala @@ -199,6 +199,9 @@ class ParamsSuite extends SparkFunSuite { val inArray = ParamValidators.inArray[Int](Array(1, 2)) assert(inArray(1) inArray(2) !inArray(0)) + +val arrayLengthGt = ParamValidators.arrayLengthGt[Int](2.0) +assert(arrayLengthGt(Array(0, 1, 2)) !arrayLengthGt(Array(0, 1))) } test(Params.copyValues) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARKR] [MINOR] Get rid of a long line warning
Repository: spark Updated Branches: refs/heads/branch-1.5 9b42e2404 - 0a1385e31 [SPARKR] [MINOR] Get rid of a long line warning ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc - callJStatic(org.apache.spark.sql.functions, lit, ifelse(class(x) == Column, xjc, x)) ^ ``` Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8297 from yu-iskw/minor-lint-r. (cherry picked from commit b4b35f133aecaf84f04e8e444b660a33c6b7894a) Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0a1385e3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0a1385e3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0a1385e3 Branch: refs/heads/branch-1.5 Commit: 0a1385e319a2bca115b6bfefe7820b78ce5fb753 Parents: 9b42e24 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 19:18:05 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 19:18:13 2015 -0700 -- R/pkg/R/functions.R | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0a1385e3/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 6eef4d6..e606b20 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -71,7 +71,9 @@ createFunctions() #' @return Creates a Column class of literal value. setMethod(lit, signature(ANY), function(x) { -jc - callJStatic(org.apache.spark.sql.functions, lit, ifelse(class(x) == Column, x@jc, x)) +jc - callJStatic(org.apache.spark.sql.functions, + lit, + ifelse(class(x) == Column, x@jc, x)) column(jc) }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10095] [SQL] use public API of BigInteger
Repository: spark Updated Branches: refs/heads/master bf32c1f7f - 270ee6777 [SPARK-10095] [SQL] use public API of BigInteger In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu dav...@databricks.com Closes #8286 from davies/portable_decimal. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/270ee677 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/270ee677 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/270ee677 Branch: refs/heads/master Commit: 270ee677750a1f2adaf24b5816857194e61782ff Parents: bf32c1f Author: Davies Liu dav...@databricks.com Authored: Tue Aug 18 20:39:59 2015 -0700 Committer: Davies Liu davies@gmail.com Committed: Tue Aug 18 20:39:59 2015 -0700 -- .../sql/catalyst/expressions/UnsafeRow.java | 29 ++-- .../catalyst/expressions/UnsafeRowWriters.java | 9 ++ .../java/org/apache/spark/unsafe/Platform.java | 18 3 files changed, 11 insertions(+), 45 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/270ee677/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index 7fd9477..6c02004 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -273,14 +273,13 @@ public final class UnsafeRow extends MutableRow { } else { final BigInteger integer = value.toJavaBigDecimal().unscaledValue(); -final int[] mag = (int[]) Platform.getObjectVolatile(integer, - Platform.BIG_INTEGER_MAG_OFFSET); -assert(mag.length = 4); +byte[] bytes = integer.toByteArray(); +assert(bytes.length = 16); // Write the bytes to the variable length portion. Platform.copyMemory( - mag, Platform.INT_ARRAY_OFFSET, baseObject, baseOffset + cursor, mag.length * 4); -setLong(ordinal, (cursor 32) | ((long) (((integer.signum() + 1) 8) + mag.length))); + bytes, Platform.BYTE_ARRAY_OFFSET, baseObject, baseOffset + cursor, bytes.length); +setLong(ordinal, (cursor 32) | ((long) bytes.length)); } } } @@ -375,8 +374,6 @@ public final class UnsafeRow extends MutableRow { return Platform.getDouble(baseObject, getFieldOffset(ordinal)); } - private static byte[] EMPTY = new byte[0]; - @Override public Decimal getDecimal(int ordinal, int precision, int scale) { if (isNullAt(ordinal)) { @@ -385,20 +382,10 @@ public final class UnsafeRow extends MutableRow { if (precision = Decimal.MAX_LONG_DIGITS()) { return Decimal.apply(getLong(ordinal), precision, scale); } else { - long offsetAndSize = getLong(ordinal); - long offset = offsetAndSize 32; - int signum = ((int) (offsetAndSize 0xfff) 8); - assert signum =0 signum = 2 : invalid signum + signum; - int size = (int) (offsetAndSize 0xff); - int[] mag = new int[size]; - Platform.copyMemory( -baseObject, baseOffset + offset, mag, Platform.INT_ARRAY_OFFSET, size * 4); - - // create a BigInteger using signum and mag - BigInteger v = new BigInteger(0, EMPTY); // create the initial object - Platform.putInt(v, Platform.BIG_INTEGER_SIGNUM_OFFSET, signum - 1); - Platform.putObjectVolatile(v, Platform.BIG_INTEGER_MAG_OFFSET, mag); - return Decimal.apply(new BigDecimal(v, scale), precision, scale); + byte[] bytes = getBinary(ordinal); + BigInteger bigInteger = new BigInteger(bytes); + BigDecimal javaDecimal = new BigDecimal(bigInteger, scale); + return Decimal.apply(javaDecimal, precision, scale); } } http://git-wip-us.apache.org/repos/asf/spark/blob/270ee677/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java index 005351f..2f43db6 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java +++
spark git commit: [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity
Repository: spark Updated Branches: refs/heads/master b4b35f133 - 1aeae05bb [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das tathagata.das1...@gmail.com Closes #8257 from tdas/SPARK-10072. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1aeae05b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1aeae05b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1aeae05b Branch: refs/heads/master Commit: 1aeae05bb20f01ab7ccaa62fe905a63e020074b5 Parents: b4b35f1 Author: Tathagata Das tathagata.das1...@gmail.com Authored: Tue Aug 18 19:26:38 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 19:26:38 2015 -0700 -- .../streaming/receiver/BlockGenerator.scala | 29 +--- 1 file changed, 19 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1aeae05b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala index 300e820..421d60a 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala @@ -227,16 +227,21 @@ private[streaming] class BlockGenerator( def isStopped(): Boolean = state == StoppedAll /** Change the buffer to which single records are added to. */ - private def updateCurrentBuffer(time: Long): Unit = synchronized { + private def updateCurrentBuffer(time: Long): Unit = { try { - val newBlockBuffer = currentBuffer - currentBuffer = new ArrayBuffer[Any] - if (newBlockBuffer.size 0) { -val blockId = StreamBlockId(receiverId, time - blockIntervalMs) -val newBlock = new Block(blockId, newBlockBuffer) -listener.onGenerateBlock(blockId) + var newBlock: Block = null + synchronized { +if (currentBuffer.nonEmpty) { + val newBlockBuffer = currentBuffer + currentBuffer = new ArrayBuffer[Any] + val blockId = StreamBlockId(receiverId, time - blockIntervalMs) + listener.onGenerateBlock(blockId) + newBlock = new Block(blockId, newBlockBuffer) +} + } + + if (newBlock != null) { blocksForPushing.put(newBlock) // put is blocking when queue is full -logDebug(Last element in + blockId + is + newBlockBuffer.last) } } catch { case ie: InterruptedException = @@ -250,9 +255,13 @@ private[streaming] class BlockGenerator( private def keepPushingBlocks() { logInfo(Started block pushing thread) -def isGeneratingBlocks = synchronized { state == Active || state == StoppedAddingData } +def areBlocksBeingGenerated: Boolean = synchronized { + state != StoppedGeneratingBlocks +} + try { - while (isGeneratingBlocks) { + // While blocks are being generated, keep polling for to-be-pushed blocks and push them. + while (areBlocksBeingGenerated) { Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match { case Some(block) = pushBlock(block) case None = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity
Repository: spark Updated Branches: refs/heads/branch-1.5 0a1385e31 - 08c5962a2 [SPARK-10072] [STREAMING] BlockGenerator can deadlock when the queue of generate blocks fills up to capacity Generated blocks are inserted into an ArrayBlockingQueue, and another thread pulls stuff from the ArrayBlockingQueue and pushes it into BlockManager. Now if that queue fills up to capacity (default is 10 blocks), then the inserting into queue (done in the function updateCurrentBuffer) get blocked inside a synchronized block. However, the thread that is pulling blocks from the queue uses the same lock to check the current (active or stopped) while pulling from the queue. Since the block generating threads is blocked (as the queue is full) on the lock, this thread that is supposed to drain the queue gets blocked. Ergo, deadlock. Solution: Moved blocking call to ArrayBlockingQueue outside the synchronized to prevent deadlock. Author: Tathagata Das tathagata.das1...@gmail.com Closes #8257 from tdas/SPARK-10072. (cherry picked from commit 1aeae05bb20f01ab7ccaa62fe905a63e020074b5) Signed-off-by: Tathagata Das tathagata.das1...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/08c5962a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/08c5962a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/08c5962a Branch: refs/heads/branch-1.5 Commit: 08c5962a251555e7d34460135ab6c32cce584704 Parents: 0a1385e Author: Tathagata Das tathagata.das1...@gmail.com Authored: Tue Aug 18 19:26:38 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 19:26:51 2015 -0700 -- .../streaming/receiver/BlockGenerator.scala | 29 +--- 1 file changed, 19 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/08c5962a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala index 300e820..421d60a 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/BlockGenerator.scala @@ -227,16 +227,21 @@ private[streaming] class BlockGenerator( def isStopped(): Boolean = state == StoppedAll /** Change the buffer to which single records are added to. */ - private def updateCurrentBuffer(time: Long): Unit = synchronized { + private def updateCurrentBuffer(time: Long): Unit = { try { - val newBlockBuffer = currentBuffer - currentBuffer = new ArrayBuffer[Any] - if (newBlockBuffer.size 0) { -val blockId = StreamBlockId(receiverId, time - blockIntervalMs) -val newBlock = new Block(blockId, newBlockBuffer) -listener.onGenerateBlock(blockId) + var newBlock: Block = null + synchronized { +if (currentBuffer.nonEmpty) { + val newBlockBuffer = currentBuffer + currentBuffer = new ArrayBuffer[Any] + val blockId = StreamBlockId(receiverId, time - blockIntervalMs) + listener.onGenerateBlock(blockId) + newBlock = new Block(blockId, newBlockBuffer) +} + } + + if (newBlock != null) { blocksForPushing.put(newBlock) // put is blocking when queue is full -logDebug(Last element in + blockId + is + newBlockBuffer.last) } } catch { case ie: InterruptedException = @@ -250,9 +255,13 @@ private[streaming] class BlockGenerator( private def keepPushingBlocks() { logInfo(Started block pushing thread) -def isGeneratingBlocks = synchronized { state == Active || state == StoppedAddingData } +def areBlocksBeingGenerated: Boolean = synchronized { + state != StoppedGeneratingBlocks +} + try { - while (isGeneratingBlocks) { + // While blocks are being generated, keep polling for to-be-pushed blocks and push them. + while (areBlocksBeingGenerated) { Option(blocksForPushing.poll(10, TimeUnit.MILLISECONDS)) match { case Some(block) = pushBlock(block) case None = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites
Repository: spark Updated Branches: refs/heads/branch-1.5 a6f8979c8 - bb2fb59f9 [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian l...@databricks.com Closes #8168 from liancheng/spark-9939/use-java-process-api. (cherry picked from commit a5b5b936596ceb45f5f5b68bf1d6368534fb9470) Signed-off-by: Cheng Lian l...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb2fb59f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb2fb59f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb2fb59f Branch: refs/heads/branch-1.5 Commit: bb2fb59f9d94f2bf175e06eae87ccefbdbbbf724 Parents: a6f8979 Author: Cheng Lian l...@databricks.com Authored: Wed Aug 19 11:21:46 2015 +0800 Committer: Cheng Lian l...@databricks.com Committed: Wed Aug 19 11:22:31 2015 +0800 -- .../spark/sql/test/ProcessTestUtils.scala | 37 + sql/hive-thriftserver/pom.xml | 7 ++ .../spark/sql/hive/thriftserver/CliSuite.scala | 64 ++- .../thriftserver/HiveThriftServer2Suites.scala | 83 +++- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 49 5 files changed, 149 insertions(+), 91 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala new file mode 100644 index 000..152c9c8 --- /dev/null +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.test + +import java.io.{IOException, InputStream} + +import scala.sys.process.BasicIO + +object ProcessTestUtils { + class ProcessOutputCapturer(stream: InputStream, capture: String = Unit) extends Thread { +this.setDaemon(true) + +override def run(): Unit = { + try { +BasicIO.processFully(capture)(stream) + } catch { case _: IOException = +// Ignores the IOException thrown when the process termination, which closes the input +// stream abruptly. + } +} + } +} http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/hive-thriftserver/pom.xml -- diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml index 2dfbcb2..3566c87 100644 --- a/sql/hive-thriftserver/pom.xml +++ b/sql/hive-thriftserver/pom.xml @@ -86,6 +86,13 @@ artifactIdselenium-java/artifactId scopetest/scope /dependency +dependency + groupIdorg.apache.spark/groupId + artifactIdspark-sql_${scala.binary.version}/artifactId + typetest-jar/type + version${project.version}/version + scopetest/scope +/dependency /dependencies build outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory http://git-wip-us.apache.org/repos/asf/spark/blob/bb2fb59f/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 121b3e0..e59a14e 100644 ---
spark git commit: [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites
Repository: spark Updated Branches: refs/heads/master 90273eff9 - a5b5b9365 [SPARK-9939] [SQL] Resorts to Java process API in CliSuite, HiveSparkSubmitSuite and HiveThriftServer2 test suites Scala process API has a known bug ([SI-8768] [1]), which may be the reason why several test suites which fork sub-processes are flaky. This PR replaces Scala process API with Java process API in `CliSuite`, `HiveSparkSubmitSuite`, and `HiveThriftServer2` related test suites to see whether it fix these flaky tests. [1]: https://issues.scala-lang.org/browse/SI-8768 Author: Cheng Lian l...@databricks.com Closes #8168 from liancheng/spark-9939/use-java-process-api. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a5b5b936 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a5b5b936 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a5b5b936 Branch: refs/heads/master Commit: a5b5b936596ceb45f5f5b68bf1d6368534fb9470 Parents: 90273ef Author: Cheng Lian l...@databricks.com Authored: Wed Aug 19 11:21:46 2015 +0800 Committer: Cheng Lian l...@databricks.com Committed: Wed Aug 19 11:21:46 2015 +0800 -- .../spark/sql/test/ProcessTestUtils.scala | 37 + sql/hive-thriftserver/pom.xml | 7 ++ .../spark/sql/hive/thriftserver/CliSuite.scala | 64 ++- .../thriftserver/HiveThriftServer2Suites.scala | 83 +++- .../spark/sql/hive/HiveSparkSubmitSuite.scala | 49 5 files changed, 149 insertions(+), 91 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala new file mode 100644 index 000..152c9c8 --- /dev/null +++ b/sql/core/src/test/scala/org/apache/spark/sql/test/ProcessTestUtils.scala @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.test + +import java.io.{IOException, InputStream} + +import scala.sys.process.BasicIO + +object ProcessTestUtils { + class ProcessOutputCapturer(stream: InputStream, capture: String = Unit) extends Thread { +this.setDaemon(true) + +override def run(): Unit = { + try { +BasicIO.processFully(capture)(stream) + } catch { case _: IOException = +// Ignores the IOException thrown when the process termination, which closes the input +// stream abruptly. + } +} + } +} http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/hive-thriftserver/pom.xml -- diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml index 2dfbcb2..3566c87 100644 --- a/sql/hive-thriftserver/pom.xml +++ b/sql/hive-thriftserver/pom.xml @@ -86,6 +86,13 @@ artifactIdselenium-java/artifactId scopetest/scope /dependency +dependency + groupIdorg.apache.spark/groupId + artifactIdspark-sql_${scala.binary.version}/artifactId + typetest-jar/type + version${project.version}/version + scopetest/scope +/dependency /dependencies build outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory http://git-wip-us.apache.org/repos/asf/spark/blob/a5b5b936/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala -- diff --git a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala index 121b3e0..e59a14e 100644 --- a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala +++
spark git commit: [SPARKR] [MINOR] Get rid of a long line warning
Repository: spark Updated Branches: refs/heads/master 1f8902964 - b4b35f133 [SPARKR] [MINOR] Get rid of a long line warning ``` R/functions.R:74:1: style: lines should not be more than 100 characters. jc - callJStatic(org.apache.spark.sql.functions, lit, ifelse(class(x) == Column, xjc, x)) ^ ``` Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8297 from yu-iskw/minor-lint-r. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b4b35f13 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b4b35f13 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b4b35f13 Branch: refs/heads/master Commit: b4b35f133aecaf84f04e8e444b660a33c6b7894a Parents: 1f89029 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 19:18:05 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 19:18:05 2015 -0700 -- R/pkg/R/functions.R | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b4b35f13/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 6eef4d6..e606b20 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -71,7 +71,9 @@ createFunctions() #' @return Creates a Column class of literal value. setMethod(lit, signature(ANY), function(x) { -jc - callJStatic(org.apache.spark.sql.functions, lit, ifelse(class(x) == Column, x@jc, x)) +jc - callJStatic(org.apache.spark.sql.functions, + lit, + ifelse(class(x) == Column, x@jc, x)) column(jc) }) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started
Repository: spark Updated Branches: refs/heads/branch-1.5 08c5962a2 - a6f8979c8 [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing zsxw...@gmail.com Closes #8294 from zsxwing/SPARK-9504. (cherry picked from commit 90273eff9604439a5a5853077e232d34555c67d7) Signed-off-by: Tathagata Das tathagata.das1...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6f8979c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6f8979c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6f8979c Branch: refs/heads/branch-1.5 Commit: a6f8979c81c5355759f74e8b3c9eb3cafb6a9c7f Parents: 08c5962 Author: zsxwing zsxw...@gmail.com Authored: Tue Aug 18 20:15:54 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 20:16:18 2015 -0700 -- .../spark/streaming/scheduler/ReceiverTracker.scala | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a6f8979c/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala index e076fb5..aae3acf 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala @@ -468,8 +468,13 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false * Start a receiver along with its scheduled executors */ private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = { + def shouldStartReceiver: Boolean = { +// It's okay to start when trackerState is Initialized or Started +!(isTrackerStopping || isTrackerStopped) + } + val receiverId = receiver.streamId - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) return } @@ -494,14 +499,14 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false // We will keep restarting the receiver job until ReceiverTracker is stopped future.onComplete { case Success(_) = - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logInfo(sRestarting Receiver $receiverId) self.send(RestartReceiver(receiver)) } case Failure(e) = - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logError(Receiver has been stopped. Try to restart it., e) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started
Repository: spark Updated Branches: refs/heads/master 1aeae05bb - 90273eff9 [SPARK-10102] [STREAMING] Fix a race condition that startReceiver may happen before setting trackerState to Started Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=spark-test/3305/testReport/junit/org.apache.spark.streaming/StreamingContextSuite/stop_gracefully/ There is a race condition that setting `trackerState` to `Started` could happen after calling `startReceiver`. Then `startReceiver` won't start the receivers because it uses `! isTrackerStarted` to check if ReceiverTracker is stopping or stopped. But actually, `trackerState` is `Initialized` and will be changed to `Started` soon. Therefore, we should use `isTrackerStopping || isTrackerStopped`. Author: zsxwing zsxw...@gmail.com Closes #8294 from zsxwing/SPARK-9504. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/90273eff Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/90273eff Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/90273eff Branch: refs/heads/master Commit: 90273eff9604439a5a5853077e232d34555c67d7 Parents: 1aeae05 Author: zsxwing zsxw...@gmail.com Authored: Tue Aug 18 20:15:54 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 20:15:54 2015 -0700 -- .../spark/streaming/scheduler/ReceiverTracker.scala | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/90273eff/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala -- diff --git a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala index e076fb5..aae3acf 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala @@ -468,8 +468,13 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false * Start a receiver along with its scheduled executors */ private def startReceiver(receiver: Receiver[_], scheduledExecutors: Seq[String]): Unit = { + def shouldStartReceiver: Boolean = { +// It's okay to start when trackerState is Initialized or Started +!(isTrackerStopping || isTrackerStopped) + } + val receiverId = receiver.streamId - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) return } @@ -494,14 +499,14 @@ class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false // We will keep restarting the receiver job until ReceiverTracker is stopped future.onComplete { case Success(_) = - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logInfo(sRestarting Receiver $receiverId) self.send(RestartReceiver(receiver)) } case Failure(e) = - if (!isTrackerStarted) { + if (!shouldStartReceiver) { onReceiverJobFinish(receiverId) } else { logError(Receiver has been stopped. Try to restart it., e) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR
Repository: spark Updated Branches: refs/heads/master a5b5b9365 - bf32c1f7f [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8266 from yu-iskw/SPARK-10075. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf32c1f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf32c1f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf32c1f7 Branch: refs/heads/master Commit: bf32c1f7f47dd907d787469f979c5859e02ce5e6 Parents: a5b5b93 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 20:27:36 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 20:27:36 2015 -0700 -- R/pkg/NAMESPACE | 2 ++ R/pkg/R/column.R | 14 ++ R/pkg/R/functions.R | 14 ++ R/pkg/R/generics.R | 8 R/pkg/inst/tests/test_sparkSQL.R | 7 +++ 5 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 607aef2..8fa12d5 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -152,6 +152,7 @@ exportMethods(abs, n_distinct, nanvl, negate, + otherwise, pmod, quarter, reverse, @@ -182,6 +183,7 @@ exportMethods(abs, unhex, upper, weekofyear, + when, year) exportClasses(GroupedData) http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/column.R -- diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R index 328f595..5a07ebd 100644 --- a/R/pkg/R/column.R +++ b/R/pkg/R/column.R @@ -203,3 +203,17 @@ setMethod(%in%, jc - callJMethod(x@jc, in, table) return(column(jc)) }) + +#' otherwise +#' +#' If values in the specified column are null, returns the value. +#' Can be used in conjunction with `when` to specify a default value for expressions. +#' +#' @rdname column +setMethod(otherwise, + signature(x = Column, value = ANY), + function(x, value) { +value - ifelse(class(value) == Column, value@jc, value) +jc - callJMethod(x@jc, otherwise, value) +column(jc) + }) http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index e606b20..366c230 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -165,3 +165,17 @@ setMethod(n, signature(x = Column), function(x) { count(x) }) + +#' when +#' +#' Evaluates a list of conditions and returns one of multiple possible result expressions. +#' For unmatched expressions null is returned. +#' +#' @rdname column +setMethod(when, signature(condition = Column, value = ANY), + function(condition, value) { + condition - condition@jc + value - ifelse(class(value) == Column, value@jc, value) + jc - callJStatic(org.apache.spark.sql.functions, when, condition, value) + column(jc) + }) http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 5c1cc98..338b32e 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -651,6 +651,14 @@ setGeneric(rlike, function(x, ...) { standardGeneric(rlike) }) #' @export setGeneric(startsWith, function(x, ...) { standardGeneric(startsWith) }) +#' @rdname column +#' @export +setGeneric(when, function(condition, value) { standardGeneric(when) }) + +#' @rdname column +#' @export +setGeneric(otherwise, function(x, value) { standardGeneric(otherwise) }) + ## Expression Function Methods ## http://git-wip-us.apache.org/repos/asf/spark/blob/bf32c1f7/R/pkg/inst/tests/test_sparkSQL.R
spark git commit: [SPARK-10095] [SQL] use public API of BigInteger
Repository: spark Updated Branches: refs/heads/branch-1.5 ebaeb1892 - 11c933505 [SPARK-10095] [SQL] use public API of BigInteger In UnsafeRow, we use the private field of BigInteger for better performance, but it actually didn't contribute much (3% in one benchmark) to end-to-end runtime, and make it not portable (may fail on other JVM implementations). So we should use the public API instead. cc rxin Author: Davies Liu dav...@databricks.com Closes #8286 from davies/portable_decimal. (cherry picked from commit 270ee677750a1f2adaf24b5816857194e61782ff) Signed-off-by: Davies Liu davies@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/11c93350 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/11c93350 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/11c93350 Branch: refs/heads/branch-1.5 Commit: 11c93350589f682f5713049b959a51549acc9aca Parents: ebaeb18 Author: Davies Liu dav...@databricks.com Authored: Tue Aug 18 20:39:59 2015 -0700 Committer: Davies Liu davies@gmail.com Committed: Tue Aug 18 20:40:12 2015 -0700 -- .../sql/catalyst/expressions/UnsafeRow.java | 29 ++-- .../catalyst/expressions/UnsafeRowWriters.java | 9 ++ .../java/org/apache/spark/unsafe/Platform.java | 18 3 files changed, 11 insertions(+), 45 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/11c93350/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index 7fd9477..6c02004 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -273,14 +273,13 @@ public final class UnsafeRow extends MutableRow { } else { final BigInteger integer = value.toJavaBigDecimal().unscaledValue(); -final int[] mag = (int[]) Platform.getObjectVolatile(integer, - Platform.BIG_INTEGER_MAG_OFFSET); -assert(mag.length = 4); +byte[] bytes = integer.toByteArray(); +assert(bytes.length = 16); // Write the bytes to the variable length portion. Platform.copyMemory( - mag, Platform.INT_ARRAY_OFFSET, baseObject, baseOffset + cursor, mag.length * 4); -setLong(ordinal, (cursor 32) | ((long) (((integer.signum() + 1) 8) + mag.length))); + bytes, Platform.BYTE_ARRAY_OFFSET, baseObject, baseOffset + cursor, bytes.length); +setLong(ordinal, (cursor 32) | ((long) bytes.length)); } } } @@ -375,8 +374,6 @@ public final class UnsafeRow extends MutableRow { return Platform.getDouble(baseObject, getFieldOffset(ordinal)); } - private static byte[] EMPTY = new byte[0]; - @Override public Decimal getDecimal(int ordinal, int precision, int scale) { if (isNullAt(ordinal)) { @@ -385,20 +382,10 @@ public final class UnsafeRow extends MutableRow { if (precision = Decimal.MAX_LONG_DIGITS()) { return Decimal.apply(getLong(ordinal), precision, scale); } else { - long offsetAndSize = getLong(ordinal); - long offset = offsetAndSize 32; - int signum = ((int) (offsetAndSize 0xfff) 8); - assert signum =0 signum = 2 : invalid signum + signum; - int size = (int) (offsetAndSize 0xff); - int[] mag = new int[size]; - Platform.copyMemory( -baseObject, baseOffset + offset, mag, Platform.INT_ARRAY_OFFSET, size * 4); - - // create a BigInteger using signum and mag - BigInteger v = new BigInteger(0, EMPTY); // create the initial object - Platform.putInt(v, Platform.BIG_INTEGER_SIGNUM_OFFSET, signum - 1); - Platform.putObjectVolatile(v, Platform.BIG_INTEGER_MAG_OFFSET, mag); - return Decimal.apply(new BigDecimal(v, scale), precision, scale); + byte[] bytes = getBinary(ordinal); + BigInteger bigInteger = new BigInteger(bytes); + BigDecimal javaDecimal = new BigDecimal(bigInteger, scale); + return Decimal.apply(javaDecimal, precision, scale); } } http://git-wip-us.apache.org/repos/asf/spark/blob/11c93350/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRowWriters.java index 005351f..2f43db6 100644 ---
spark git commit: [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR
Repository: spark Updated Branches: refs/heads/branch-1.5 bb2fb59f9 - ebaeb1892 [SPARK-10075] [SPARKR] Add `when` expressino function in SparkR - Add `when` and `otherwise` as `Column` methods - Add `When` as an expression function - Add `%otherwise%` infix as an alias of `otherwise` Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think? ### JIRA [[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075) Author: Yu ISHIKAWA yuu.ishik...@gmail.com Closes #8266 from yu-iskw/SPARK-10075. (cherry picked from commit bf32c1f7f47dd907d787469f979c5859e02ce5e6) Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebaeb189 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebaeb189 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebaeb189 Branch: refs/heads/branch-1.5 Commit: ebaeb189260dd338fc5a91d8ec3ff6d45989991a Parents: bb2fb59 Author: Yu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 20:27:36 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 20:29:34 2015 -0700 -- R/pkg/NAMESPACE | 2 ++ R/pkg/R/column.R | 14 ++ R/pkg/R/functions.R | 14 ++ R/pkg/R/generics.R | 8 R/pkg/inst/tests/test_sparkSQL.R | 7 +++ 5 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 607aef2..8fa12d5 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -152,6 +152,7 @@ exportMethods(abs, n_distinct, nanvl, negate, + otherwise, pmod, quarter, reverse, @@ -182,6 +183,7 @@ exportMethods(abs, unhex, upper, weekofyear, + when, year) exportClasses(GroupedData) http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/column.R -- diff --git a/R/pkg/R/column.R b/R/pkg/R/column.R index 328f595..5a07ebd 100644 --- a/R/pkg/R/column.R +++ b/R/pkg/R/column.R @@ -203,3 +203,17 @@ setMethod(%in%, jc - callJMethod(x@jc, in, table) return(column(jc)) }) + +#' otherwise +#' +#' If values in the specified column are null, returns the value. +#' Can be used in conjunction with `when` to specify a default value for expressions. +#' +#' @rdname column +setMethod(otherwise, + signature(x = Column, value = ANY), + function(x, value) { +value - ifelse(class(value) == Column, value@jc, value) +jc - callJMethod(x@jc, otherwise, value) +column(jc) + }) http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/functions.R -- diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index e606b20..366c230 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -165,3 +165,17 @@ setMethod(n, signature(x = Column), function(x) { count(x) }) + +#' when +#' +#' Evaluates a list of conditions and returns one of multiple possible result expressions. +#' For unmatched expressions null is returned. +#' +#' @rdname column +setMethod(when, signature(condition = Column, value = ANY), + function(condition, value) { + condition - condition@jc + value - ifelse(class(value) == Column, value@jc, value) + jc - callJStatic(org.apache.spark.sql.functions, when, condition, value) + column(jc) + }) http://git-wip-us.apache.org/repos/asf/spark/blob/ebaeb189/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 5c1cc98..338b32e 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -651,6 +651,14 @@ setGeneric(rlike, function(x, ...) { standardGeneric(rlike) }) #' @export setGeneric(startsWith, function(x, ...) { standardGeneric(startsWith) }) +#' @rdname column +#' @export +setGeneric(when, function(condition, value) { standardGeneric(when) }) + +#' @rdname column +#' @export +setGeneric(otherwise, function(x, value) { standardGeneric(otherwise) }) + ## Expression Function Methods
spark git commit: [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars
Repository: spark Updated Branches: refs/heads/master 8bae9015b - bf1d6614d [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing zsxw...@gmail.com Closes #8069 from zsxwing/SPARK-9574. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bf1d6614 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bf1d6614 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bf1d6614 Branch: refs/heads/master Commit: bf1d6614dcb8f5974e62e406d9c0f8aac52556d3 Parents: 8bae901 Author: zsxwing zsxw...@gmail.com Authored: Tue Aug 18 13:35:45 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 13:35:45 2015 -0700 -- external/flume-assembly/pom.xml | 11 + external/kafka-assembly/pom.xml | 84 external/mqtt-assembly/pom.xml | 74 extras/kinesis-asl-assembly/pom.xml | 79 ++ pom.xml | 2 +- 5 files changed, 249 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bf1d6614/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 1318959..e05e431 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -69,6 +69,11 @@ scopeprovided/scope /dependency dependency + groupIdcommons-lang/groupId + artifactIdcommons-lang/artifactId + scopeprovided/scope +/dependency +dependency groupIdcommons-net/groupId artifactIdcommons-net/artifactId scopeprovided/scope @@ -89,6 +94,12 @@ scopeprovided/scope /dependency dependency + groupIdorg.apache.avro/groupId + artifactIdavro-mapred/artifactId + classifier${avro.mapred.classifier}/classifier + scopeprovided/scope +/dependency +dependency groupIdorg.scala-lang/groupId artifactIdscala-library/artifactId scopeprovided/scope http://git-wip-us.apache.org/repos/asf/spark/blob/bf1d6614/external/kafka-assembly/pom.xml -- diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml index 977514f..36342f3 100644 --- a/external/kafka-assembly/pom.xml +++ b/external/kafka-assembly/pom.xml @@ -47,6 +47,90 @@ version${project.version}/version scopeprovided/scope /dependency +!-- + Demote already included in the Spark assembly. +-- +dependency + groupIdcommons-codec/groupId + artifactIdcommons-codec/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcommons-lang/groupId + artifactIdcommons-lang/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.google.protobuf/groupId + artifactIdprotobuf-java/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.sun.jersey/groupId + artifactIdjersey-server/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.sun.jersey/groupId + artifactIdjersey-core/artifactId + scopeprovided/scope +/dependency +dependency + groupIdnet.jpountz.lz4/groupId + artifactIdlz4/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.hadoop/groupId + artifactIdhadoop-client/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.avro/groupId + artifactIdavro-mapred/artifactId + classifier${avro.mapred.classifier}/classifier + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.curator/groupId + artifactIdcurator-recipes/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.zookeeper/groupId + artifactIdzookeeper/artifactId + scopeprovided/scope +/dependency +dependency + groupIdlog4j/groupId + artifactIdlog4j/artifactId + scopeprovided/scope +/dependency +dependency + groupIdnet.java.dev.jets3t/groupId + artifactIdjets3t/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.scala-lang/groupId + artifactIdscala-library/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.slf4j/groupId + artifactIdslf4j-api/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.slf4j/groupId + artifactIdslf4j-log4j12/artifactId
spark git commit: [SPARK-9782] [YARN] Support YARN application tags via SparkConf
Repository: spark Updated Branches: refs/heads/master 80cb25b22 - 9b731fad2 [SPARK-9782] [YARN] Support YARN application tags via SparkConf Add a new test case in yarn/ClientSuite which checks how the various SparkConf and ClientArguments propagate into the ApplicationSubmissionContext. Author: Dennis Huo d...@google.com Closes #8072 from dennishuo/dhuo-yarn-application-tags. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b731fad Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b731fad Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b731fad Branch: refs/heads/master Commit: 9b731fad2b43ca18f3c5274062d4c7bc2622ab72 Parents: 80cb25b Author: Dennis Huo d...@google.com Authored: Tue Aug 18 14:34:20 2015 -0700 Committer: Sandy Ryza sa...@cloudera.com Committed: Tue Aug 18 14:34:20 2015 -0700 -- docs/running-on-yarn.md | 8 + .../org/apache/spark/deploy/yarn/Client.scala | 21 .../apache/spark/deploy/yarn/ClientSuite.scala | 36 3 files changed, 65 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/docs/running-on-yarn.md -- diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index ec32c41..8ac26e9 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -320,6 +320,14 @@ If you need a reference to the proper location to put log files in the YARN so t /td /tr tr + tdcodespark.yarn.tags/code/td + td(none)/td + td + Comma-separated list of strings to pass through as YARN application tags appearing + in YARN ApplicationReports, which can be used for filtering when querying YARN apps. + /td +/tr +tr tdcodespark.yarn.keytab/code/td td(none)/td td http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala -- diff --git a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala index 6d63dda..5c6a716 100644 --- a/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala +++ b/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala @@ -163,6 +163,23 @@ private[spark] class Client( appContext.setQueue(args.amQueue) appContext.setAMContainerSpec(containerContext) appContext.setApplicationType(SPARK) +sparkConf.getOption(CONF_SPARK_YARN_APPLICATION_TAGS) + .map(StringUtils.getTrimmedStringCollection(_)) + .filter(!_.isEmpty()) + .foreach { tagCollection = +try { + // The setApplicationTags method was only introduced in Hadoop 2.4+, so we need to use + // reflection to set it, printing a warning if a tag was specified but the YARN version + // doesn't support it. + val method = appContext.getClass().getMethod( +setApplicationTags, classOf[java.util.Set[String]]) + method.invoke(appContext, new java.util.HashSet[String](tagCollection)) +} catch { + case e: NoSuchMethodException = +logWarning(sIgnoring $CONF_SPARK_YARN_APPLICATION_TAGS because this version of + + YARN does not support it) +} + } sparkConf.getOption(spark.yarn.maxAppAttempts).map(_.toInt) match { case Some(v) = appContext.setMaxAppAttempts(v) case None = logDebug(spark.yarn.maxAppAttempts is not set. + @@ -987,6 +1004,10 @@ object Client extends Logging { // of the executors val CONF_SPARK_YARN_SECONDARY_JARS = spark.yarn.secondary.jars + // Comma-separated list of strings to pass through as YARN application tags appearing + // in YARN ApplicationReports, which can be used for filtering when querying YARN. + val CONF_SPARK_YARN_APPLICATION_TAGS = spark.yarn.tags + // Staging directory is private! - rwx val STAGING_DIR_PERMISSION: FsPermission = FsPermission.createImmutable(Integer.parseInt(700, 8).toShort) http://git-wip-us.apache.org/repos/asf/spark/blob/9b731fad/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala -- diff --git a/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala b/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala index 837f8d3..0a5402c 100644 --- a/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala +++ b/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala @@ -29,8 +29,11 @@ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.Path import org.apache.hadoop.mapreduce.MRJobConfig import
spark git commit: [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.
Repository: spark Updated Branches: refs/heads/branch-1.5 74a6b1a13 - 8b0df5a5e [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser. Author: Marcelo Vanzin van...@cloudera.com Closes #8282 from vanzin/SPARK-10088. (cherry picked from commit 492ac1facbc79ee251d45cff315598ec9935a0e2) Signed-off-by: Michael Armbrust mich...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8b0df5a5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8b0df5a5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8b0df5a5 Branch: refs/heads/branch-1.5 Commit: 8b0df5a5e6dbd90d0741e0f2279222574b8acb20 Parents: 74a6b1a Author: Marcelo Vanzin van...@cloudera.com Authored: Tue Aug 18 14:45:19 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 14:45:35 2015 -0700 -- .../main/scala/org/apache/spark/sql/hive/HiveQl.scala | 11 +++ .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 12 ++-- 2 files changed, 13 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8b0df5a5/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala index c3f2935..ad33dee 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala @@ -729,6 +729,17 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C inputFormat = Option(org.apache.hadoop.mapred.SequenceFileInputFormat), outputFormat = Option(org.apache.hadoop.mapred.SequenceFileOutputFormat)) +case avro = + tableDesc = tableDesc.copy( +inputFormat = + Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat), +outputFormat = + Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat)) + if (tableDesc.serde.isEmpty) { +tableDesc = tableDesc.copy( + serde = Option(org.apache.hadoop.hive.serde2.avro.AvroSerDe)) + } + case _ = throw new SemanticException( sUnrecognized file format in STORED AS clause: ${child.getText}) http://git-wip-us.apache.org/repos/asf/spark/blob/8b0df5a5/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala index 4eae699..4da8663 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala @@ -25,10 +25,8 @@ import scala.language.implicitConversions import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.ql.exec.FunctionRegistry -import org.apache.hadoop.hive.ql.io.avro.{AvroContainerInputFormat, AvroContainerOutputFormat} import org.apache.hadoop.hive.ql.processors._ import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe -import org.apache.hadoop.hive.serde2.avro.AvroSerDe import org.apache.spark.sql.SQLConf import org.apache.spark.sql.catalyst.analysis._ @@ -276,10 +274,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { INSERT OVERWRITE TABLE serdeins SELECT * FROM src.cmd), TestTable(episodes, sCREATE TABLE episodes (title STRING, air_date STRING, doctor INT) - |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}' - |STORED AS - |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}' - |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}' + |STORED AS avro |TBLPROPERTIES ( | 'avro.schema.literal'='{ |type: record, @@ -312,10 +307,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { TestTable(episodes_part, sCREATE TABLE episodes_part (title STRING, air_date STRING, doctor INT) |PARTITIONED BY (doctor_pt INT) - |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}' - |STORED AS - |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}' - |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}' + |STORED AS avro |TBLPROPERTIES ( | 'avro.schema.literal'='{ |type: record,
spark git commit: [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser.
Repository: spark Updated Branches: refs/heads/master fa41e0242 - 492ac1fac [SPARK-10088] [SQL] Add support for stored as avro in HiveQL parser. Author: Marcelo Vanzin van...@cloudera.com Closes #8282 from vanzin/SPARK-10088. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/492ac1fa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/492ac1fa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/492ac1fa Branch: refs/heads/master Commit: 492ac1facbc79ee251d45cff315598ec9935a0e2 Parents: fa41e02 Author: Marcelo Vanzin van...@cloudera.com Authored: Tue Aug 18 14:45:19 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Tue Aug 18 14:45:19 2015 -0700 -- .../main/scala/org/apache/spark/sql/hive/HiveQl.scala | 11 +++ .../scala/org/apache/spark/sql/hive/test/TestHive.scala | 12 ++-- 2 files changed, 13 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/492ac1fa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala index c3f2935..ad33dee 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala @@ -729,6 +729,17 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C inputFormat = Option(org.apache.hadoop.mapred.SequenceFileInputFormat), outputFormat = Option(org.apache.hadoop.mapred.SequenceFileOutputFormat)) +case avro = + tableDesc = tableDesc.copy( +inputFormat = + Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat), +outputFormat = + Option(org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat)) + if (tableDesc.serde.isEmpty) { +tableDesc = tableDesc.copy( + serde = Option(org.apache.hadoop.hive.serde2.avro.AvroSerDe)) + } + case _ = throw new SemanticException( sUnrecognized file format in STORED AS clause: ${child.getText}) http://git-wip-us.apache.org/repos/asf/spark/blob/492ac1fa/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala index 4eae699..4da8663 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala @@ -25,10 +25,8 @@ import scala.language.implicitConversions import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.ql.exec.FunctionRegistry -import org.apache.hadoop.hive.ql.io.avro.{AvroContainerInputFormat, AvroContainerOutputFormat} import org.apache.hadoop.hive.ql.processors._ import org.apache.hadoop.hive.serde2.`lazy`.LazySimpleSerDe -import org.apache.hadoop.hive.serde2.avro.AvroSerDe import org.apache.spark.sql.SQLConf import org.apache.spark.sql.catalyst.analysis._ @@ -276,10 +274,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { INSERT OVERWRITE TABLE serdeins SELECT * FROM src.cmd), TestTable(episodes, sCREATE TABLE episodes (title STRING, air_date STRING, doctor INT) - |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}' - |STORED AS - |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}' - |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}' + |STORED AS avro |TBLPROPERTIES ( | 'avro.schema.literal'='{ |type: record, @@ -312,10 +307,7 @@ class TestHiveContext(sc: SparkContext) extends HiveContext(sc) { TestTable(episodes_part, sCREATE TABLE episodes_part (title STRING, air_date STRING, doctor INT) |PARTITIONED BY (doctor_pt INT) - |ROW FORMAT SERDE '${classOf[AvroSerDe].getCanonicalName}' - |STORED AS - |INPUTFORMAT '${classOf[AvroContainerInputFormat].getCanonicalName}' - |OUTPUTFORMAT '${classOf[AvroContainerOutputFormat].getCanonicalName}' + |STORED AS avro |TBLPROPERTIES ( | 'avro.schema.literal'='{ |type: record, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands,
spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree
Repository: spark Updated Branches: refs/heads/branch-1.5 8b0df5a5e - 56f4da263 [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler bjcut...@us.ibm.com Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. (cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56f4da26 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56f4da26 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56f4da26 Branch: refs/heads/branch-1.5 Commit: 56f4da2633aab6d1f25c03b1cf567c2c68374fb5 Parents: 8b0df5a Author: Bryan Cutler bjcut...@us.ibm.com Authored: Tue Aug 18 14:58:30 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 14:58:37 2015 -0700 -- .../apache/spark/mllib/tree/DecisionTree.scala | 13 +++ .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++ .../apache/spark/mllib/tree/RandomForest.scala | 10 ++ .../spark/mllib/tree/configuration/Algo.scala | 1 + .../tree/configuration/BoostingStrategy.scala | 6 .../mllib/tree/configuration/FeatureType.scala | 1 + .../tree/configuration/QuantileStrategy.scala | 1 + .../mllib/tree/configuration/Strategy.scala | 20 ++- .../spark/mllib/tree/impurity/Entropy.scala | 4 +++ .../apache/spark/mllib/tree/impurity/Gini.scala | 4 +++ .../spark/mllib/tree/impurity/Impurity.scala| 3 ++ .../spark/mllib/tree/impurity/Variance.scala| 4 +++ .../spark/mllib/tree/loss/AbsoluteError.scala | 2 ++ .../apache/spark/mllib/tree/loss/LogLoss.scala | 2 ++ .../org/apache/spark/mllib/tree/loss/Loss.scala | 3 ++ .../apache/spark/mllib/tree/loss/Losses.scala | 6 .../spark/mllib/tree/loss/SquaredError.scala| 2 ++ .../mllib/tree/model/DecisionTreeModel.scala| 22 .../mllib/tree/model/InformationGainStats.scala | 1 + .../apache/spark/mllib/tree/model/Node.scala| 3 ++ .../apache/spark/mllib/tree/model/Predict.scala | 1 + .../apache/spark/mllib/tree/model/Split.scala | 1 + .../mllib/tree/model/treeEnsembleModels.scala | 37 .../org/apache/spark/mllib/tree/package.scala | 1 + 24 files changed, 157 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/56f4da26/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala index cecd1fe..e5200b8 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala @@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom * @param strategy The configuration parameters for the tree algorithm which specify the type * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. + * @since 1.0.0 */ @Experimental class DecisionTree (private val strategy: Strategy) extends Serializable with Logging { @@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * Method to train a decision tree model over an RDD * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] * @return DecisionTreeModel that can be used for prediction + * @since 1.2.0 */ def run(input: RDD[LabeledPoint]): DecisionTreeModel = { // Note: random seed will not be used since numTrees = 1. @@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo } } +/** + * @since 1.0.0 + */ object DecisionTree extends Serializable with Logging { /** @@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging { * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = { new DecisionTree(strategy).run(input) @@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging { * @param maxDepth Maximum depth of the tree. * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. * @return DecisionTreeModel that can be used for prediction + * @since
spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree
Repository: spark Updated Branches: refs/heads/master 492ac1fac - 1dbffba37 [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler bjcut...@us.ibm.com Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dbffba3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dbffba3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dbffba3 Branch: refs/heads/master Commit: 1dbffba37a84c62202befd3911d25888f958191d Parents: 492ac1f Author: Bryan Cutler bjcut...@us.ibm.com Authored: Tue Aug 18 14:58:30 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 14:58:30 2015 -0700 -- .../apache/spark/mllib/tree/DecisionTree.scala | 13 +++ .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++ .../apache/spark/mllib/tree/RandomForest.scala | 10 ++ .../spark/mllib/tree/configuration/Algo.scala | 1 + .../tree/configuration/BoostingStrategy.scala | 6 .../mllib/tree/configuration/FeatureType.scala | 1 + .../tree/configuration/QuantileStrategy.scala | 1 + .../mllib/tree/configuration/Strategy.scala | 20 ++- .../spark/mllib/tree/impurity/Entropy.scala | 4 +++ .../apache/spark/mllib/tree/impurity/Gini.scala | 4 +++ .../spark/mllib/tree/impurity/Impurity.scala| 3 ++ .../spark/mllib/tree/impurity/Variance.scala| 4 +++ .../spark/mllib/tree/loss/AbsoluteError.scala | 2 ++ .../apache/spark/mllib/tree/loss/LogLoss.scala | 2 ++ .../org/apache/spark/mllib/tree/loss/Loss.scala | 3 ++ .../apache/spark/mllib/tree/loss/Losses.scala | 6 .../spark/mllib/tree/loss/SquaredError.scala| 2 ++ .../mllib/tree/model/DecisionTreeModel.scala| 22 .../mllib/tree/model/InformationGainStats.scala | 1 + .../apache/spark/mllib/tree/model/Node.scala| 3 ++ .../apache/spark/mllib/tree/model/Predict.scala | 1 + .../apache/spark/mllib/tree/model/Split.scala | 1 + .../mllib/tree/model/treeEnsembleModels.scala | 37 .../org/apache/spark/mllib/tree/package.scala | 1 + 24 files changed, 157 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1dbffba3/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala index cecd1fe..e5200b8 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala @@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom * @param strategy The configuration parameters for the tree algorithm which specify the type * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. + * @since 1.0.0 */ @Experimental class DecisionTree (private val strategy: Strategy) extends Serializable with Logging { @@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * Method to train a decision tree model over an RDD * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] * @return DecisionTreeModel that can be used for prediction + * @since 1.2.0 */ def run(input: RDD[LabeledPoint]): DecisionTreeModel = { // Note: random seed will not be used since numTrees = 1. @@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo } } +/** + * @since 1.0.0 + */ object DecisionTree extends Serializable with Logging { /** @@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging { * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = { new DecisionTree(strategy).run(input) @@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging { * @param maxDepth Maximum depth of the tree. * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train( input: RDD[LabeledPoint], @@ -127,6 +134,7 @@ object DecisionTree extends Serializable with
spark git commit: Bump SparkR version string to 1.5.0
Repository: spark Updated Branches: refs/heads/master badf7fa65 - 04e0fea79 Bump SparkR version string to 1.5.0 This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein hoss...@databricks.com Closes #8291 from falaki/SparkRVersion1.5. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/04e0fea7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/04e0fea7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/04e0fea7 Branch: refs/heads/master Commit: 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3 Parents: badf7fa Author: Hossein hoss...@databricks.com Authored: Tue Aug 18 18:02:22 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 18:02:22 2015 -0700 -- R/pkg/DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/04e0fea7/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 83e6489..d0d7201 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,7 +1,7 @@ Package: SparkR Type: Package Title: R frontend for Spark -Version: 1.4.0 +Version: 1.5.0 Date: 2013-09-09 Author: The Apache Software Foundation Maintainer: Shivaram Venkataraman shiva...@cs.berkeley.edu - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Bump SparkR version string to 1.5.0
Repository: spark Updated Branches: refs/heads/branch-1.5 4ee225af8 - 9b42e2404 Bump SparkR version string to 1.5.0 This patch is against master, but we need to apply it to 1.5 branch as well. cc shivaram and rxin Author: Hossein hoss...@databricks.com Closes #8291 from falaki/SparkRVersion1.5. (cherry picked from commit 04e0fea79b9acfa3a3cb81dbacb08f9d287b42c3) Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9b42e240 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9b42e240 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9b42e240 Branch: refs/heads/branch-1.5 Commit: 9b42e24049e072b315ec80e5bbe2ec5079a94704 Parents: 4ee225a Author: Hossein hoss...@databricks.com Authored: Tue Aug 18 18:02:22 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 18:02:31 2015 -0700 -- R/pkg/DESCRIPTION | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9b42e240/R/pkg/DESCRIPTION -- diff --git a/R/pkg/DESCRIPTION b/R/pkg/DESCRIPTION index 83e6489..d0d7201 100644 --- a/R/pkg/DESCRIPTION +++ b/R/pkg/DESCRIPTION @@ -1,7 +1,7 @@ Package: SparkR Type: Package Title: R frontend for Spark -Version: 1.4.0 +Version: 1.5.0 Date: 2013-09-09 Author: The Apache Software Foundation Maintainer: Shivaram Venkataraman shiva...@cs.berkeley.edu - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite
Repository: spark Updated Branches: refs/heads/master c635a16f6 - 9108eff74 [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das tathagata.das1...@gmail.com Closes #8289 from tdas/SPARK-10098. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9108eff7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9108eff7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9108eff7 Branch: refs/heads/master Commit: 9108eff74a2815986fd067b273c2a344b6315405 Parents: c635a16 Author: Tathagata Das tathagata.das1...@gmail.com Authored: Tue Aug 18 17:00:13 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 17:00:13 2015 -0700 -- .../apache/spark/streaming/FailureSuite.scala | 27 1 file changed, 17 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9108eff7/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala index 0c4c065..e82c2fa 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala @@ -17,25 +17,32 @@ package org.apache.spark.streaming -import org.apache.spark.Logging +import java.io.File + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.{SparkFunSuite, Logging} import org.apache.spark.util.Utils /** * This testsuite tests master failures at random times while the stream is running using * the real clock. */ -class FailureSuite extends TestSuiteBase with Logging { +class FailureSuite extends SparkFunSuite with BeforeAndAfter with Logging { - val directory = Utils.createTempDir() - val numBatches = 30 + private val batchDuration: Duration = Milliseconds(1000) + private val numBatches = 30 + private var directory: File = null - override def batchDuration: Duration = Milliseconds(1000) - - override def useManualClock: Boolean = false + before { +directory = Utils.createTempDir() + } - override def afterFunction() { -Utils.deleteRecursively(directory) -super.afterFunction() + after { +if (directory != null) { + Utils.deleteRecursively(directory) +} +StreamingContext.getActive().foreach { _.stop() } } test(multiple failures with map) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite
Repository: spark Updated Branches: refs/heads/branch-1.5 fb207b245 - e1b50c7d2 [SPARK-10098] [STREAMING] [TEST] Cleanup active context after test in FailureSuite Failures in streaming.FailureSuite can leak StreamingContext and SparkContext which fails all subsequent tests Author: Tathagata Das tathagata.das1...@gmail.com Closes #8289 from tdas/SPARK-10098. (cherry picked from commit 9108eff74a2815986fd067b273c2a344b6315405) Signed-off-by: Tathagata Das tathagata.das1...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e1b50c7d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e1b50c7d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e1b50c7d Branch: refs/heads/branch-1.5 Commit: e1b50c7d2a604f785e5fe9af5d60c426a6ff01c2 Parents: fb207b2 Author: Tathagata Das tathagata.das1...@gmail.com Authored: Tue Aug 18 17:00:13 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 17:00:21 2015 -0700 -- .../apache/spark/streaming/FailureSuite.scala | 27 1 file changed, 17 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e1b50c7d/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala -- diff --git a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala index 0c4c065..e82c2fa 100644 --- a/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala +++ b/streaming/src/test/scala/org/apache/spark/streaming/FailureSuite.scala @@ -17,25 +17,32 @@ package org.apache.spark.streaming -import org.apache.spark.Logging +import java.io.File + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.{SparkFunSuite, Logging} import org.apache.spark.util.Utils /** * This testsuite tests master failures at random times while the stream is running using * the real clock. */ -class FailureSuite extends TestSuiteBase with Logging { +class FailureSuite extends SparkFunSuite with BeforeAndAfter with Logging { - val directory = Utils.createTempDir() - val numBatches = 30 + private val batchDuration: Duration = Milliseconds(1000) + private val numBatches = 30 + private var directory: File = null - override def batchDuration: Duration = Milliseconds(1000) - - override def useManualClock: Boolean = false + before { +directory = Utils.createTempDir() + } - override def afterFunction() { -Utils.deleteRecursively(directory) -super.afterFunction() + after { +if (directory != null) { + Utils.deleteRecursively(directory) +} +StreamingContext.getActive().foreach { _.stop() } } test(multiple failures with map) { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT
Repository: spark Updated Branches: refs/heads/branch-1.5 e1b50c7d2 - 4ee225af8 [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT mengxr jkbradley Author: Feynman Liang fli...@databricks.com Closes #8184 from feynmanliang/SPARK-9889-DCT-docs. (cherry picked from commit badf7fa650f9801c70515907fcc26b58d7ec3143) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4ee225af Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4ee225af Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4ee225af Branch: refs/heads/branch-1.5 Commit: 4ee225af8ecb38fbcf8e43ac1c498a76f3590b98 Parents: e1b50c7 Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 17:54:49 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 17:54:58 2015 -0700 -- docs/ml-features.md | 71 1 file changed, 71 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4ee225af/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6b2e36b..28a6193 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3): /div /div +## Discrete Cosine Transform (DCT) + +The [Discrete Cosine +Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) +transforms a length $N$ real-valued sequence in the time domain into +another length $N$ real-valued sequence in the frequency domain. A +[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class +provides this functionality, implementing the +[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) +and scaling the result by $1/\sqrt{2}$ such that the representing matrix +for the transform is unitary. No shift is applied to the transformed +sequence (e.g. the $0$th element of the transformed sequence is the +$0$th DCT coefficient and _not_ the $N/2$th). + +div class=codetabs +div data-lang=scala markdown=1 +{% highlight scala %} +import org.apache.spark.ml.feature.DCT +import org.apache.spark.mllib.linalg.Vectors + +val data = Seq( + Vectors.dense(0.0, 1.0, -2.0, 3.0), + Vectors.dense(-1.0, 2.0, 4.0, -7.0), + Vectors.dense(14.0, -2.0, -5.0, 1.0)) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false) +val dctDf = dct.transform(df) +dctDf.select(featuresDCT).show(3) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.DCT; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaRDDRow data = jsc.parallelize(Arrays.asList( + RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)), + RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)), + RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0)) +)); +StructType schema = new StructType(new StructField[] { + new StructField(features, new VectorUDT(), false, Metadata.empty()), +}); +DataFrame df = jsql.createDataFrame(data, schema); +DCT dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false); +DataFrame dctDf = dct.transform(df); +dctDf.select(featuresDCT).show(3); +{% endhighlight %} +/div +/div + ## StringIndexer `StringIndexer` encodes a string column of labels to a column of label indices. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT
Repository: spark Updated Branches: refs/heads/master 9108eff74 - badf7fa65 [SPARK-8473] [SPARK-9889] [ML] User guide and example code for DCT mengxr jkbradley Author: Feynman Liang fli...@databricks.com Closes #8184 from feynmanliang/SPARK-9889-DCT-docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/badf7fa6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/badf7fa6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/badf7fa6 Branch: refs/heads/master Commit: badf7fa650f9801c70515907fcc26b58d7ec3143 Parents: 9108eff Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 17:54:49 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Tue Aug 18 17:54:49 2015 -0700 -- docs/ml-features.md | 71 1 file changed, 71 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/badf7fa6/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6b2e36b..28a6193 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -649,6 +649,77 @@ for expanded in polyDF.select(polyFeatures).take(3): /div /div +## Discrete Cosine Transform (DCT) + +The [Discrete Cosine +Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) +transforms a length $N$ real-valued sequence in the time domain into +another length $N$ real-valued sequence in the frequency domain. A +[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class +provides this functionality, implementing the +[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) +and scaling the result by $1/\sqrt{2}$ such that the representing matrix +for the transform is unitary. No shift is applied to the transformed +sequence (e.g. the $0$th element of the transformed sequence is the +$0$th DCT coefficient and _not_ the $N/2$th). + +div class=codetabs +div data-lang=scala markdown=1 +{% highlight scala %} +import org.apache.spark.ml.feature.DCT +import org.apache.spark.mllib.linalg.Vectors + +val data = Seq( + Vectors.dense(0.0, 1.0, -2.0, 3.0), + Vectors.dense(-1.0, 2.0, 4.0, -7.0), + Vectors.dense(14.0, -2.0, -5.0, 1.0)) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false) +val dctDf = dct.transform(df) +dctDf.select(featuresDCT).show(3) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.DCT; +import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaRDDRow data = jsc.parallelize(Arrays.asList( + RowFactory.create(Vectors.dense(0.0, 1.0, -2.0, 3.0)), + RowFactory.create(Vectors.dense(-1.0, 2.0, 4.0, -7.0)), + RowFactory.create(Vectors.dense(14.0, -2.0, -5.0, 1.0)) +)); +StructType schema = new StructType(new StructField[] { + new StructField(features, new VectorUDT(), false, Metadata.empty()), +}); +DataFrame df = jsql.createDataFrame(data, schema); +DCT dct = new DCT() + .setInputCol(features) + .setOutputCol(featuresDCT) + .setInverse(false); +DataFrame dctDf = dct.transform(df); +dctDf.select(featuresDCT).show(3); +{% endhighlight %} +/div +/div + ## StringIndexer `StringIndexer` encodes a string column of labels to a column of label indices. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9705] [DOC] fix docs about Python version
Repository: spark Updated Branches: refs/heads/branch-1.5 3c33931aa - 03a8a889a [SPARK-9705] [DOC] fix docs about Python version cc JoshRosen Author: Davies Liu dav...@databricks.com Closes #8245 from davies/python_doc. (cherry picked from commit de3223872a217c5224ba7136604f6b7753b29108) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/03a8a889 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/03a8a889 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/03a8a889 Branch: refs/heads/branch-1.5 Commit: 03a8a889a98ab30e4d33dc1a415aa84253111ffa Parents: 3c33931 Author: Davies Liu dav...@databricks.com Authored: Tue Aug 18 22:11:27 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:11:32 2015 -0700 -- docs/configuration.md | 6 +- docs/programming-guide.md | 12 ++-- 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/03a8a889/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 3214709..4a6e4dd 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1561,7 +1561,11 @@ The following variables can be set in `spark-env.sh`: /tr tr tdcodePYSPARK_PYTHON/code/td -tdPython binary executable to use for PySpark./td +tdPython binary executable to use for PySpark in both driver and workers (default is `python`)./td + /tr + tr +tdcodePYSPARK_DRIVER_PYTHON/code/td +tdPython binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON)./td /tr tr tdcodeSPARK_LOCAL_IP/code/td http://git-wip-us.apache.org/repos/asf/spark/blob/03a8a889/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index ae712d6..982c5ea 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -85,8 +85,8 @@ import org.apache.spark.SparkConf div data-lang=python markdown=1 -Spark {{site.SPARK_VERSION}} works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, -so C libraries like NumPy can be used. +Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter, +so C libraries like NumPy can be used. It also works with PyPy 2.3+. To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory. This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster. @@ -104,6 +104,14 @@ Finally, you need to import some Spark classes into your program. Add the follow from pyspark import SparkContext, SparkConf {% endhighlight %} +PySpark requires the same minor version of Python in both driver and workers. It uses the default python version in PATH, +you can specify which version of Python you want to use by `PYSPARK_PYTHON`, for example: + +{% highlight bash %} +$ PYSPARK_PYTHON=python3.4 bin/pyspark +$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/python/pi.py +{% endhighlight %} + /div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9705] [DOC] fix docs about Python version
Repository: spark Updated Branches: refs/heads/master 1ff0580ed - de3223872 [SPARK-9705] [DOC] fix docs about Python version cc JoshRosen Author: Davies Liu dav...@databricks.com Closes #8245 from davies/python_doc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/de322387 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/de322387 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/de322387 Branch: refs/heads/master Commit: de3223872a217c5224ba7136604f6b7753b29108 Parents: 1ff0580 Author: Davies Liu dav...@databricks.com Authored: Tue Aug 18 22:11:27 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:11:27 2015 -0700 -- docs/configuration.md | 6 +- docs/programming-guide.md | 12 ++-- 2 files changed, 15 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/de322387/docs/configuration.md -- diff --git a/docs/configuration.md b/docs/configuration.md index 3214709..4a6e4dd 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -1561,7 +1561,11 @@ The following variables can be set in `spark-env.sh`: /tr tr tdcodePYSPARK_PYTHON/code/td -tdPython binary executable to use for PySpark./td +tdPython binary executable to use for PySpark in both driver and workers (default is `python`)./td + /tr + tr +tdcodePYSPARK_DRIVER_PYTHON/code/td +tdPython binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON)./td /tr tr tdcodeSPARK_LOCAL_IP/code/td http://git-wip-us.apache.org/repos/asf/spark/blob/de322387/docs/programming-guide.md -- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index ae712d6..982c5ea 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -85,8 +85,8 @@ import org.apache.spark.SparkConf div data-lang=python markdown=1 -Spark {{site.SPARK_VERSION}} works with Python 2.6 or higher (but not Python 3). It uses the standard CPython interpreter, -so C libraries like NumPy can be used. +Spark {{site.SPARK_VERSION}} works with Python 2.6+ or Python 3.4+. It can use the standard CPython interpreter, +so C libraries like NumPy can be used. It also works with PyPy 2.3+. To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory. This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster. @@ -104,6 +104,14 @@ Finally, you need to import some Spark classes into your program. Add the follow from pyspark import SparkContext, SparkConf {% endhighlight %} +PySpark requires the same minor version of Python in both driver and workers. It uses the default python version in PATH, +you can specify which version of Python you want to use by `PYSPARK_PYTHON`, for example: + +{% highlight bash %} +$ PYSPARK_PYTHON=python3.4 bin/pyspark +$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/python/pi.py +{% endhighlight %} + /div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
Repository: spark Updated Branches: refs/heads/master 1c843e284 - 010b03ed5 [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 10 val s = Seq.fill(numItems)(1) for (i - 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen joshro...@databricks.com Closes #8178 from JoshRosen/dagscheduler-perf. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/010b03ed Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/010b03ed Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/010b03ed Branch: refs/heads/master Commit: 010b03ed52f35fd4d426d522f8a9927ddc579209 Parents: 1c843e2 Author: Josh Rosen joshro...@databricks.com Authored: Tue Aug 18 22:30:13 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:30:13 2015 -0700 -- .../apache/spark/scheduler/DAGScheduler.scala | 22 ++-- .../spark/storage/BlockManagerMaster.scala | 5 +++-- .../storage/BlockManagerMasterEndpoint.scala| 3 ++- .../spark/scheduler/DAGSchedulerSuite.scala | 4 ++-- 4 files changed, 18 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/010b03ed/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index dadf83a..684db66 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -111,7 +111,7 @@ class DAGScheduler( * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). */ - private val cacheLocs = new HashMap[Int, Seq[Seq[TaskLocation]]] + private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with // every task. When we detect a node failing, we note the current epoch number and failed @@ -205,12 +205,12 @@ class DAGScheduler( } private[scheduler] - def getCacheLocs(rdd: RDD[_]): Seq[Seq[TaskLocation]] = cacheLocs.synchronized { + def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { // Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times if (!cacheLocs.contains(rdd.id)) { // Note: if the storage level is NONE, we don't need to get locations from block manager. - val locs: Seq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { -Seq.fill(rdd.partitions.size)(Nil) + val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { +IndexedSeq.fill(rdd.partitions.length)(Nil) } else { val blockIds = rdd.partitions.indices.map(index = RDDBlockId(rdd.id, index)).toArray[BlockId] @@ -302,12 +302,12 @@ class DAGScheduler( shuffleDep: ShuffleDependency[_, _, _], firstJobId: Int): ShuffleMapStage = { val rdd = shuffleDep.rdd -val numTasks = rdd.partitions.size +val numTasks = rdd.partitions.length val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite) if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - for (i - 0 until locs.size) { + for (i - 0 until locs.length) { stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null if missing } stage.numAvailableOutputs = locs.count(_ != null) @@ -315,7 +315,7 @@ class DAGScheduler( // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown
spark git commit: [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types
Repository: spark Updated Branches: refs/heads/master 270ee6777 - 1ff0580ed [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin r...@databricks.com Author: Michael Armbrust mich...@databricks.com Closes #8295 from rxin/SPARK-10096. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ff0580e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ff0580e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ff0580e Branch: refs/heads/master Commit: 1ff0580eda90f9247a5233809667f5cebaea290e Parents: 270ee67 Author: Reynold Xin r...@databricks.com Authored: Tue Aug 18 22:08:15 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:08:15 2015 -0700 -- .../expressions/complexTypeCreator.scala| 8 +++- .../spark/sql/execution/basicOperators.scala| 12 ++--- .../spark/sql/DataFrameComplexTypeSuite.scala | 46 .../org/apache/spark/sql/DataFrameSuite.scala | 9 4 files changed, 68 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ff0580e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala index 298aee3..1c54671 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala @@ -206,7 +206,9 @@ case class CreateStructUnsafe(children: Seq[Expression]) extends Expression { override def nullable: Boolean = false - override def eval(input: InternalRow): Any = throw new UnsupportedOperationException + override def eval(input: InternalRow): Any = { +InternalRow(children.map(_.eval(input)): _*) + } override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { val eval = GenerateUnsafeProjection.createCode(ctx, children) @@ -244,7 +246,9 @@ case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression override def nullable: Boolean = false - override def eval(input: InternalRow): Any = throw new UnsupportedOperationException + override def eval(input: InternalRow): Any = { +InternalRow(valExprs.map(_.eval(input)): _*) + } override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { val eval = GenerateUnsafeProjection.createCode(ctx, valExprs) http://git-wip-us.apache.org/repos/asf/spark/blob/1ff0580e/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala index 77b9806..3f68b05 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala @@ -75,14 +75,16 @@ case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) override def output: Seq[Attribute] = projectList.map(_.toAttribute) + /** Rewrite the project list to use unsafe expressions as needed. */ + protected val unsafeProjectList = projectList.map(_ transform { +case CreateStruct(children) = CreateStructUnsafe(children) +case CreateNamedStruct(children) = CreateNamedStructUnsafe(children) + }) + protected override def doExecute(): RDD[InternalRow] = { val numRows = longMetric(numRows) child.execute().mapPartitions { iter = - this.transformAllExpressions { -case CreateStruct(children) = CreateStructUnsafe(children) -case
spark git commit: [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types
Repository: spark Updated Branches: refs/heads/branch-1.5 11c933505 - 3c33931aa [SPARK-10093] [SPARK-10096] [SQL] Avoid transformation on executors fix UDFs on complex types This is kind of a weird case, but given a sufficiently complex query plan (in this case a TungstenProject with an Exchange underneath), we could have NPEs on the executors due to the time when we were calling transformAllExpressions In general we should ensure that all transformations occur on the driver and not on the executors. Some reasons for avoid executor side transformations include: * (this case) Some operator constructors require state such as access to the Spark/SQL conf so doing a makeCopy on the executor can fail. * (unrelated reason for avoid executor transformations) ExprIds are calculated using an atomic integer, so you can violate their uniqueness constraint by constructing them anywhere other than the driver. This subsumes #8285. Author: Reynold Xin r...@databricks.com Author: Michael Armbrust mich...@databricks.com Closes #8295 from rxin/SPARK-10096. (cherry picked from commit 1ff0580eda90f9247a5233809667f5cebaea290e) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3c33931a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3c33931a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3c33931a Branch: refs/heads/branch-1.5 Commit: 3c33931aa58db8ccc138e7f2e3c8ee94d25c7242 Parents: 11c9335 Author: Reynold Xin r...@databricks.com Authored: Tue Aug 18 22:08:15 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:08:22 2015 -0700 -- .../expressions/complexTypeCreator.scala| 8 +++- .../spark/sql/execution/basicOperators.scala| 12 ++--- .../spark/sql/DataFrameComplexTypeSuite.scala | 46 .../org/apache/spark/sql/DataFrameSuite.scala | 9 4 files changed, 68 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3c33931a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala index 298aee3..1c54671 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala @@ -206,7 +206,9 @@ case class CreateStructUnsafe(children: Seq[Expression]) extends Expression { override def nullable: Boolean = false - override def eval(input: InternalRow): Any = throw new UnsupportedOperationException + override def eval(input: InternalRow): Any = { +InternalRow(children.map(_.eval(input)): _*) + } override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { val eval = GenerateUnsafeProjection.createCode(ctx, children) @@ -244,7 +246,9 @@ case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression override def nullable: Boolean = false - override def eval(input: InternalRow): Any = throw new UnsupportedOperationException + override def eval(input: InternalRow): Any = { +InternalRow(valExprs.map(_.eval(input)): _*) + } override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { val eval = GenerateUnsafeProjection.createCode(ctx, valExprs) http://git-wip-us.apache.org/repos/asf/spark/blob/3c33931a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala index 77b9806..3f68b05 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicOperators.scala @@ -75,14 +75,16 @@ case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) override def output: Seq[Attribute] = projectList.map(_.toAttribute) + /** Rewrite the project list to use unsafe expressions as needed. */ + protected val unsafeProjectList = projectList.map(_ transform { +case CreateStruct(children) = CreateStructUnsafe(children) +case CreateNamedStruct(children) = CreateNamedStructUnsafe(children) + }) + protected override def doExecute(): RDD[InternalRow] = { val numRows = longMetric(numRows) child.execute().mapPartitions { iter
spark git commit: [SPARK-9508] GraphX Pregel docs update with new Pregel code
Repository: spark Updated Branches: refs/heads/branch-1.5 03a8a889a - 416392697 [SPARK-9508] GraphX Pregel docs update with new Pregel code SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov na...@yandex.ru Closes #7831 from avulanov/SPARK-9508-pregel-doc2. (cherry picked from commit 1c843e284818004f16c3f1101c33b510f80722e3) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/41639269 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/41639269 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/41639269 Branch: refs/heads/branch-1.5 Commit: 416392697f20de27b37db0cf0bad15a0e5ac9ebe Parents: 03a8a88 Author: Alexander Ulanov na...@yandex.ru Authored: Tue Aug 18 22:13:52 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:13:57 2015 -0700 -- docs/graphx-programming-guide.md | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/41639269/docs/graphx-programming-guide.md -- diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index 99f8c82..c861a763 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -768,16 +768,14 @@ class GraphOps[VD, ED] { // Loop until no messages remain or maxIterations is achieved var i = 0 while (activeMessages 0 i maxIterations) { - // Receive the messages: --- - // Run the vertex program on all vertices that receive messages - val newVerts = g.vertices.innerJoin(messages)(vprog).cache() - // Merge the new vertex values back into the graph - g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) }.cache() - // Send Messages: -- - // Vertices that didn't receive a message above don't appear in newVerts and therefore don't - // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked - // on edges in the activeDir of vertices in newVerts - messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache() + // Receive the messages and update the vertices. + g = g.joinVertices(messages)(vprog).cache() + val oldMessages = messages + // Send new messages, skipping edges where neither side received a message. We must cache + // messages so it can be materialized on the next line, allowing us to uncache the previous + // iteration. + messages = g.mapReduceTriplets( +sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() activeMessages = messages.count() i += 1 } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9508] GraphX Pregel docs update with new Pregel code
Repository: spark Updated Branches: refs/heads/master de3223872 - 1c843e284 [SPARK-9508] GraphX Pregel docs update with new Pregel code SPARK-9436 simplifies the Pregel code. graphx-programming-guide needs to be modified accordingly since it lists the old Pregel code Author: Alexander Ulanov na...@yandex.ru Closes #7831 from avulanov/SPARK-9508-pregel-doc2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c843e28 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c843e28 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c843e28 Branch: refs/heads/master Commit: 1c843e284818004f16c3f1101c33b510f80722e3 Parents: de32238 Author: Alexander Ulanov na...@yandex.ru Authored: Tue Aug 18 22:13:52 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:13:52 2015 -0700 -- docs/graphx-programming-guide.md | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1c843e28/docs/graphx-programming-guide.md -- diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index 99f8c82..c861a763 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -768,16 +768,14 @@ class GraphOps[VD, ED] { // Loop until no messages remain or maxIterations is achieved var i = 0 while (activeMessages 0 i maxIterations) { - // Receive the messages: --- - // Run the vertex program on all vertices that receive messages - val newVerts = g.vertices.innerJoin(messages)(vprog).cache() - // Merge the new vertex values back into the graph - g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = newOpt.getOrElse(old) }.cache() - // Send Messages: -- - // Vertices that didn't receive a message above don't appear in newVerts and therefore don't - // get to send messages. More precisely the map phase of mapReduceTriplets is only invoked - // on edges in the activeDir of vertices in newVerts - messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache() + // Receive the messages and update the vertices. + g = g.joinVertices(messages)(vprog).cache() + val oldMessages = messages + // Send new messages, skipping edges where neither side received a message. We must cache + // messages so it can be materialized on the next line, allowing us to uncache the previous + // iteration. + messages = g.mapReduceTriplets( +sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() activeMessages = messages.count() i += 1 } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
Repository: spark Updated Branches: refs/heads/branch-1.5 416392697 - 3ceee5572 [SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 10 val s = Seq.fill(numItems)(1) for (i - 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen joshro...@databricks.com Closes #8178 from JoshRosen/dagscheduler-perf. (cherry picked from commit 010b03ed52f35fd4d426d522f8a9927ddc579209) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3ceee557 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3ceee557 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3ceee557 Branch: refs/heads/branch-1.5 Commit: 3ceee5572fc7be690d3009f4d43af9e4611c0fa1 Parents: 4163926 Author: Josh Rosen joshro...@databricks.com Authored: Tue Aug 18 22:30:13 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Tue Aug 18 22:30:20 2015 -0700 -- .../apache/spark/scheduler/DAGScheduler.scala | 22 ++-- .../spark/storage/BlockManagerMaster.scala | 5 +++-- .../storage/BlockManagerMasterEndpoint.scala| 3 ++- .../spark/scheduler/DAGSchedulerSuite.scala | 4 ++-- 4 files changed, 18 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3ceee557/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala index dadf83a..684db66 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala @@ -111,7 +111,7 @@ class DAGScheduler( * * All accesses to this map should be guarded by synchronizing on it (see SPARK-4454). */ - private val cacheLocs = new HashMap[Int, Seq[Seq[TaskLocation]]] + private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]] // For tracking failed nodes, we use the MapOutputTracker's epoch number, which is sent with // every task. When we detect a node failing, we note the current epoch number and failed @@ -205,12 +205,12 @@ class DAGScheduler( } private[scheduler] - def getCacheLocs(rdd: RDD[_]): Seq[Seq[TaskLocation]] = cacheLocs.synchronized { + def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { // Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times if (!cacheLocs.contains(rdd.id)) { // Note: if the storage level is NONE, we don't need to get locations from block manager. - val locs: Seq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { -Seq.fill(rdd.partitions.size)(Nil) + val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { +IndexedSeq.fill(rdd.partitions.length)(Nil) } else { val blockIds = rdd.partitions.indices.map(index = RDDBlockId(rdd.id, index)).toArray[BlockId] @@ -302,12 +302,12 @@ class DAGScheduler( shuffleDep: ShuffleDependency[_, _, _], firstJobId: Int): ShuffleMapStage = { val rdd = shuffleDep.rdd -val numTasks = rdd.partitions.size +val numTasks = rdd.partitions.length val stage = newShuffleMapStage(rdd, numTasks, shuffleDep, firstJobId, rdd.creationSite) if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId) val locs = MapOutputTracker.deserializeMapStatuses(serLocs) - for (i - 0 until locs.size) { + for (i - 0 until locs.length) { stage.outputLocs(i) = Option(locs(i)).toList // locs(i) will be null if missing } stage.numAvailableOutputs = locs.count(_ != null) @@ -315,7 +315,7 @@ class DAGScheduler( // Kind of ugly: need to register RDDs with the cache and map output
spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/master f4fa61eff - 747c2ba80 [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/747c2ba8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/747c2ba8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/747c2ba8 Branch: refs/heads/master Commit: 747c2ba8006d5b86f3be8dfa9ace639042a35628 Parents: f4fa61e Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:36 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/747c2ba8/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 80debff12 - ec7079f9c [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. (cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec7079f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec7079f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec7079f9 Branch: refs/heads/branch-1.5 Commit: ec7079f9c94cb98efdac6f92b7c85efb0e67492e Parents: 80debff Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:43 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ec7079f9/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 7ff0e5d2f - 80debff12 [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang yblia...@gmail.com Closes #8225 from yanboliang/spark-10029. (cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80debff1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80debff1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80debff1 Branch: refs/heads/branch-1.5 Commit: 80debff123e0b5dcc4e6f5899753a736de2c8e75 Parents: 7ff0e5d Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:55:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:55:42 2015 -0700 -- docs/mllib-isotonic-regression.md | 35 ++ 1 file changed, 35 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80debff1/docs/mllib-isotonic-regression.md -- diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 5732bc4..6aa881f 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath); IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), myModelPath); {% endhighlight %} /div + +div data-lang=python markdown=1 +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +{% highlight python %} +import math +from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel + +data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt) + +# Create label, feature, weight tuples from input data with weight set to default value 1.0. +parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) + (1.0,)) + +# Split data into training (60%) and test (40%) sets. +training, test = parsedData.randomSplit([0.6, 0.4], 11) + +# Create isotonic regression model from training data. +# Isotonic parameter defaults to true so it is only shown for demonstration +model = IsotonicRegression.train(training) + +# Create tuples of predicted and real labels. +predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0])) + +# Calculate mean squared error between predicted and real labels. +meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 2)).mean() +print(Mean Squared Error = + str(meanSquaredError)) + +# Save and load model +model.save(sc, myModelPath) +sameModel = IsotonicRegressionModel.load(sc, myModelPath) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars
Repository: spark Updated Branches: refs/heads/branch-1.5 9bd2e6f7c - 2bccd918f [SPARK-9574] [STREAMING] Remove unnecessary contents of spark-streaming-XXX-assembly jars Removed contents already included in Spark assembly jar from spark-streaming-XXX-assembly jars. Author: zsxwing zsxw...@gmail.com Closes #8069 from zsxwing/SPARK-9574. (cherry picked from commit bf1d6614dcb8f5974e62e406d9c0f8aac52556d3) Signed-off-by: Tathagata Das tathagata.das1...@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2bccd918 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2bccd918 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2bccd918 Branch: refs/heads/branch-1.5 Commit: 2bccd918fcd278f0b544e61b9675ecdf2d6974b3 Parents: 9bd2e6f Author: zsxwing zsxw...@gmail.com Authored: Tue Aug 18 13:35:45 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Tue Aug 18 13:36:25 2015 -0700 -- external/flume-assembly/pom.xml | 11 + external/kafka-assembly/pom.xml | 84 external/mqtt-assembly/pom.xml | 74 extras/kinesis-asl-assembly/pom.xml | 79 ++ pom.xml | 2 +- 5 files changed, 249 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2bccd918/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 1318959..e05e431 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -69,6 +69,11 @@ scopeprovided/scope /dependency dependency + groupIdcommons-lang/groupId + artifactIdcommons-lang/artifactId + scopeprovided/scope +/dependency +dependency groupIdcommons-net/groupId artifactIdcommons-net/artifactId scopeprovided/scope @@ -89,6 +94,12 @@ scopeprovided/scope /dependency dependency + groupIdorg.apache.avro/groupId + artifactIdavro-mapred/artifactId + classifier${avro.mapred.classifier}/classifier + scopeprovided/scope +/dependency +dependency groupIdorg.scala-lang/groupId artifactIdscala-library/artifactId scopeprovided/scope http://git-wip-us.apache.org/repos/asf/spark/blob/2bccd918/external/kafka-assembly/pom.xml -- diff --git a/external/kafka-assembly/pom.xml b/external/kafka-assembly/pom.xml index 977514f..36342f3 100644 --- a/external/kafka-assembly/pom.xml +++ b/external/kafka-assembly/pom.xml @@ -47,6 +47,90 @@ version${project.version}/version scopeprovided/scope /dependency +!-- + Demote already included in the Spark assembly. +-- +dependency + groupIdcommons-codec/groupId + artifactIdcommons-codec/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcommons-lang/groupId + artifactIdcommons-lang/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.google.protobuf/groupId + artifactIdprotobuf-java/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.sun.jersey/groupId + artifactIdjersey-server/artifactId + scopeprovided/scope +/dependency +dependency + groupIdcom.sun.jersey/groupId + artifactIdjersey-core/artifactId + scopeprovided/scope +/dependency +dependency + groupIdnet.jpountz.lz4/groupId + artifactIdlz4/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.hadoop/groupId + artifactIdhadoop-client/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.avro/groupId + artifactIdavro-mapred/artifactId + classifier${avro.mapred.classifier}/classifier + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.curator/groupId + artifactIdcurator-recipes/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.apache.zookeeper/groupId + artifactIdzookeeper/artifactId + scopeprovided/scope +/dependency +dependency + groupIdlog4j/groupId + artifactIdlog4j/artifactId + scopeprovided/scope +/dependency +dependency + groupIdnet.java.dev.jets3t/groupId + artifactIdjets3t/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.scala-lang/groupId + artifactIdscala-library/artifactId + scopeprovided/scope +/dependency +dependency + groupIdorg.slf4j/groupId +
spark git commit: [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions
Repository: spark Updated Branches: refs/heads/master 5723d26d7 - 1968276af [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA yuu.ishik...@gmail.com Closes #8277 from yu-iskw/SPARK-10007. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1968276a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1968276a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1968276a Branch: refs/heads/master Commit: 1968276af0f681fe51328b7dd795bd21724a5441 Parents: 5723d26 Author: Yuu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 09:10:59 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 09:10:59 2015 -0700 -- R/pkg/NAMESPACE | 50 +++--- 1 file changed, 47 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1968276a/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index fd9dfdf..607aef2 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -87,48 +87,86 @@ exportMethods(abs, alias, approxCountDistinct, asc, + ascii, asin, atan, atan2, avg, + base64, between, + bin, + bitwiseNOT, cast, cbrt, + ceil, ceiling, + concat, contains, cos, cosh, - concat, + count, countDistinct, + crc32, + datediff, + dayofmonth, + dayofyear, desc, endsWith, exp, + explode, expm1, + factorial, + first, floor, getField, getItem, greatest, + hex, + hour, hypot, + initcap, + isNaN, isNotNull, isNull, - lit, last, + last_day, least, + length, + levenshtein, like, + lit, log, log10, log1p, + log2, lower, + ltrim, max, + md5, mean, min, + minute, + month, + months_between, n, n_distinct, + nanvl, + negate, + pmod, + quarter, + reverse, rint, rlike, + round, + rtrim, + second, + sha1, sign, + signum, sin, sinh, + size, + soundex, sqrt, startsWith, substr, @@ -138,7 +176,13 @@ exportMethods(abs, tanh, toDegrees, toRadians, - upper) + to_date, + trim, + unbase64, + unhex, + upper, + weekofyear, + year) exportClasses(GroupedData) exportMethods(agg) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions
Repository: spark Updated Branches: refs/heads/branch-1.5 a512250cd - 20a760a00 [SPARK-10007] [SPARKR] Update `NAMESPACE` file in SparkR for simple parameters functions ### JIRA [[SPARK-10007] Update `NAMESPACE` file in SparkR for simple parameters functions - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10007) Author: Yuu ISHIKAWA yuu.ishik...@gmail.com Closes #8277 from yu-iskw/SPARK-10007. (cherry picked from commit 1968276af0f681fe51328b7dd795bd21724a5441) Signed-off-by: Shivaram Venkataraman shiva...@cs.berkeley.edu Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/20a760a0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/20a760a0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/20a760a0 Branch: refs/heads/branch-1.5 Commit: 20a760a00ae188a68b877f052842834e8b7570e6 Parents: a512250 Author: Yuu ISHIKAWA yuu.ishik...@gmail.com Authored: Tue Aug 18 09:10:59 2015 -0700 Committer: Shivaram Venkataraman shiva...@cs.berkeley.edu Committed: Tue Aug 18 09:11:22 2015 -0700 -- R/pkg/NAMESPACE | 50 +++--- 1 file changed, 47 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/20a760a0/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index fd9dfdf..607aef2 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -87,48 +87,86 @@ exportMethods(abs, alias, approxCountDistinct, asc, + ascii, asin, atan, atan2, avg, + base64, between, + bin, + bitwiseNOT, cast, cbrt, + ceil, ceiling, + concat, contains, cos, cosh, - concat, + count, countDistinct, + crc32, + datediff, + dayofmonth, + dayofyear, desc, endsWith, exp, + explode, expm1, + factorial, + first, floor, getField, getItem, greatest, + hex, + hour, hypot, + initcap, + isNaN, isNotNull, isNull, - lit, last, + last_day, least, + length, + levenshtein, like, + lit, log, log10, log1p, + log2, lower, + ltrim, max, + md5, mean, min, + minute, + month, + months_between, n, n_distinct, + nanvl, + negate, + pmod, + quarter, + reverse, rint, rlike, + round, + rtrim, + second, + sha1, sign, + signum, sin, sinh, + size, + soundex, sqrt, startsWith, substr, @@ -138,7 +176,13 @@ exportMethods(abs, tanh, toDegrees, toRadians, - upper) + to_date, + trim, + unbase64, + unhex, + upper, + weekofyear, + year) exportClasses(GroupedData) exportMethods(agg) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org