Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-snapshot-20150803 [deleted] 35264204b - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9263] Added flags to exclude dependencies when using --packages
Repository: spark Updated Branches: refs/heads/branch-1.5 73c863ac8 - 34335719a [SPARK-9263] Added flags to exclude dependencies when using --packages While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing). The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions. cc andrewor14 vanzin pwendell Author: Burak Yavuz brk...@gmail.com Closes #7599 from brkyvz/packages-exclusions and squashes the following commits: 636f410 [Burak Yavuz] addressed nits 6e54ede [Burak Yavuz] is this the culprit b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions 154f5db [Burak Yavuz] addressed initial comments 1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude (cherry picked from commit 1633d0a2612d94151f620c919425026150e69ae1) Signed-off-by: Marcelo Vanzin van...@cloudera.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34335719 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34335719 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34335719 Branch: refs/heads/branch-1.5 Commit: 34335719a372c1951fdb4dd25b75b086faf1076f Parents: 73c863a Author: Burak Yavuz brk...@gmail.com Authored: Mon Aug 3 17:42:03 2015 -0700 Committer: Marcelo Vanzin van...@cloudera.com Committed: Mon Aug 3 17:42:35 2015 -0700 -- .../org/apache/spark/deploy/SparkSubmit.scala | 29 +-- .../spark/deploy/SparkSubmitArguments.scala | 11 +++ .../spark/deploy/SparkSubmitUtilsSuite.scala| 30 .../spark/launcher/SparkSubmitOptionParser.java | 2 ++ 4 files changed, 57 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34335719/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 0b39ee8..31185c8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -24,6 +24,7 @@ import java.security.PrivilegedExceptionAction import scala.collection.mutable.{ArrayBuffer, HashMap, Map} +import org.apache.commons.lang3.StringUtils import org.apache.hadoop.fs.Path import org.apache.hadoop.security.UserGroupInformation import org.apache.ivy.Ivy @@ -37,6 +38,7 @@ import org.apache.ivy.core.settings.IvySettings import org.apache.ivy.plugins.matcher.GlobPatternMatcher import org.apache.ivy.plugins.repository.file.FileRepository import org.apache.ivy.plugins.resolver.{FileSystemResolver, ChainResolver, IBiblioResolver} + import org.apache.spark.api.r.RUtils import org.apache.spark.SPARK_VERSION import org.apache.spark.deploy.rest._ @@ -275,21 +277,18 @@ object SparkSubmit { // Resolve maven dependencies if there are any and add classpath to jars. Add them to py-files // too for packages that include Python code -val resolvedMavenCoordinates = - SparkSubmitUtils.resolveMavenCoordinates( -args.packages, Option(args.repositories), Option(args.ivyRepoPath)) -if (!resolvedMavenCoordinates.trim.isEmpty) { - if (args.jars == null || args.jars.trim.isEmpty) { -args.jars = resolvedMavenCoordinates +val exclusions: Seq[String] = + if (!StringUtils.isBlank(args.packagesExclusions)) { +args.packagesExclusions.split(,) } else { -args.jars += s,$resolvedMavenCoordinates +Nil } +val resolvedMavenCoordinates = SparkSubmitUtils.resolveMavenCoordinates(args.packages, + Some(args.repositories), Some(args.ivyRepoPath), exclusions = exclusions) +if (!StringUtils.isBlank(resolvedMavenCoordinates)) { + args.jars = mergeFileLists(args.jars, resolvedMavenCoordinates) if (args.isPython) { -if (args.pyFiles == null || args.pyFiles.trim.isEmpty) { - args.pyFiles = resolvedMavenCoordinates -} else { - args.pyFiles += s,$resolvedMavenCoordinates -} +args.pyFiles = mergeFileLists(args.pyFiles, resolvedMavenCoordinates) } } @@ -736,7 +735,7 @@ object SparkSubmit { * no files, into a single comma-separated string. */ private def mergeFileLists(lists: String*): String = { -val merged =
spark git commit: [SPARK-9263] Added flags to exclude dependencies when using --packages
Repository: spark Updated Branches: refs/heads/master b79b4f5f2 - 1633d0a26 [SPARK-9263] Added flags to exclude dependencies when using --packages While the functionality is there to exclude packages, there are no flags that allow users to exclude dependencies, in case of dependency conflicts. We should provide users with a flag to add dependency exclusions in case the packages are not resolved properly (or not available due to licensing). The flag I added was --packages-exclude, but I'm open on renaming it. I also added property flags in case people would like to use a conf file to provide dependencies, which is possible if there is a long list of dependencies or exclusions. cc andrewor14 vanzin pwendell Author: Burak Yavuz brk...@gmail.com Closes #7599 from brkyvz/packages-exclusions and squashes the following commits: 636f410 [Burak Yavuz] addressed nits 6e54ede [Burak Yavuz] is this the culprit b5e508e [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into packages-exclusions 154f5db [Burak Yavuz] addressed initial comments 1536d7a [Burak Yavuz] Added flags to exclude packages using --packages-exclude Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1633d0a2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1633d0a2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1633d0a2 Branch: refs/heads/master Commit: 1633d0a2612d94151f620c919425026150e69ae1 Parents: b79b4f5 Author: Burak Yavuz brk...@gmail.com Authored: Mon Aug 3 17:42:03 2015 -0700 Committer: Marcelo Vanzin van...@cloudera.com Committed: Mon Aug 3 17:42:03 2015 -0700 -- .../org/apache/spark/deploy/SparkSubmit.scala | 29 +-- .../spark/deploy/SparkSubmitArguments.scala | 11 +++ .../spark/deploy/SparkSubmitUtilsSuite.scala| 30 .../spark/launcher/SparkSubmitOptionParser.java | 2 ++ 4 files changed, 57 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1633d0a2/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 0b39ee8..31185c8 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -24,6 +24,7 @@ import java.security.PrivilegedExceptionAction import scala.collection.mutable.{ArrayBuffer, HashMap, Map} +import org.apache.commons.lang3.StringUtils import org.apache.hadoop.fs.Path import org.apache.hadoop.security.UserGroupInformation import org.apache.ivy.Ivy @@ -37,6 +38,7 @@ import org.apache.ivy.core.settings.IvySettings import org.apache.ivy.plugins.matcher.GlobPatternMatcher import org.apache.ivy.plugins.repository.file.FileRepository import org.apache.ivy.plugins.resolver.{FileSystemResolver, ChainResolver, IBiblioResolver} + import org.apache.spark.api.r.RUtils import org.apache.spark.SPARK_VERSION import org.apache.spark.deploy.rest._ @@ -275,21 +277,18 @@ object SparkSubmit { // Resolve maven dependencies if there are any and add classpath to jars. Add them to py-files // too for packages that include Python code -val resolvedMavenCoordinates = - SparkSubmitUtils.resolveMavenCoordinates( -args.packages, Option(args.repositories), Option(args.ivyRepoPath)) -if (!resolvedMavenCoordinates.trim.isEmpty) { - if (args.jars == null || args.jars.trim.isEmpty) { -args.jars = resolvedMavenCoordinates +val exclusions: Seq[String] = + if (!StringUtils.isBlank(args.packagesExclusions)) { +args.packagesExclusions.split(,) } else { -args.jars += s,$resolvedMavenCoordinates +Nil } +val resolvedMavenCoordinates = SparkSubmitUtils.resolveMavenCoordinates(args.packages, + Some(args.repositories), Some(args.ivyRepoPath), exclusions = exclusions) +if (!StringUtils.isBlank(resolvedMavenCoordinates)) { + args.jars = mergeFileLists(args.jars, resolvedMavenCoordinates) if (args.isPython) { -if (args.pyFiles == null || args.pyFiles.trim.isEmpty) { - args.pyFiles = resolvedMavenCoordinates -} else { - args.pyFiles += s,$resolvedMavenCoordinates -} +args.pyFiles = mergeFileLists(args.pyFiles, resolvedMavenCoordinates) } } @@ -736,7 +735,7 @@ object SparkSubmit { * no files, into a single comma-separated string. */ private def mergeFileLists(lists: String*): String = { -val merged = lists.filter(_ != null) +val merged = lists.filterNot(StringUtils.isBlank) .flatMap(_.split(,))
[2/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
http://git-wip-us.apache.org/repos/asf/spark/blob/6bd12e81/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index 110f51a..567d7fa 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -20,15 +20,18 @@ package org.apache.spark.sql.hive import java.io.File import java.net.{URL, URLClassLoader} import java.sql.Timestamp +import java.util.concurrent.TimeUnit import scala.collection.JavaConversions._ import scala.collection.mutable.HashMap import scala.language.implicitConversions +import scala.concurrent.duration._ import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.hive.common.StatsSetupConst import org.apache.hadoop.hive.common.`type`.HiveDecimal import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.ql.metadata.Table import org.apache.hadoop.hive.ql.parse.VariableSubstitution import org.apache.hadoop.hive.ql.session.SessionState @@ -165,6 +168,16 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { SessionState.setCurrentSessionState(executionHive.state) /** + * Overrides default Hive configurations to avoid breaking changes to Spark SQL users. + * - allow SQL11 keywords to be used as identifiers + */ + private[sql] def defaultOverides() = { +setConf(ConfVars.HIVE_SUPPORT_SQL11_RESERVED_KEYWORDS.varname, false) + } + + defaultOverides() + + /** * The copy of the Hive client that is used to retrieve metadata from the Hive MetaStore. * The version of the Hive client that is used here must match the metastore that is configured * in the hive-site.xml file. @@ -252,6 +265,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { } protected[sql] override def parseSql(sql: String): LogicalPlan = { +var state = SessionState.get() +if (state == null) { + SessionState.setCurrentSessionState(tlSession.get().asInstanceOf[SQLSession].sessionState) +} super.parseSql(substitutor.substitute(hiveconf, sql)) } @@ -298,10 +315,21 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { // Can we use fs.getContentSummary in future? // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use // countFileSize to count the table size. +val stagingDir = metadataHive.getConf(HiveConf.ConfVars.STAGINGDIR.varname, + HiveConf.ConfVars.STAGINGDIR.defaultStrVal) + def calculateTableSize(fs: FileSystem, path: Path): Long = { val fileStatus = fs.getFileStatus(path) val size = if (fileStatus.isDir) { -fs.listStatus(path).map(status = calculateTableSize(fs, status.getPath)).sum +fs.listStatus(path) + .map { status = +if (!status.getPath().getName().startsWith(stagingDir)) { + calculateTableSize(fs, status.getPath) +} else { + 0L +} + } + .sum } else { fileStatus.getLen } @@ -398,7 +426,58 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { } /** Overridden by child classes that need to set configuration before the client init. */ - protected def configure(): Map[String, String] = Map.empty + protected def configure(): Map[String, String] = { +// Hive 0.14.0 introduces timeout operations in HiveConf, and changes default values of a bunch +// of time `ConfVar`s by adding time suffixes (`s`, `ms`, and `d` etc.). This breaks backwards- +// compatibility when users are trying to connecting to a Hive metastore of lower version, +// because these options are expected to be integral values in lower versions of Hive. +// +// Here we enumerate all time `ConfVar`s and convert their values to numeric strings according +// to their output time units. +Seq( + ConfVars.METASTORE_CLIENT_CONNECT_RETRY_DELAY - TimeUnit.SECONDS, + ConfVars.METASTORE_CLIENT_SOCKET_TIMEOUT - TimeUnit.SECONDS, + ConfVars.METASTORE_CLIENT_SOCKET_LIFETIME - TimeUnit.SECONDS, + ConfVars.HMSHANDLERINTERVAL - TimeUnit.MILLISECONDS, + ConfVars.METASTORE_EVENT_DB_LISTENER_TTL - TimeUnit.SECONDS, + ConfVars.METASTORE_EVENT_CLEAN_FREQ - TimeUnit.SECONDS, + ConfVars.METASTORE_EVENT_EXPIRY_DURATION - TimeUnit.SECONDS, + ConfVars.METASTORE_AGGREGATE_STATS_CACHE_TTL - TimeUnit.SECONDS, + ConfVars.METASTORE_AGGREGATE_STATS_CACHE_MAX_WRITER_WAIT - TimeUnit.MILLISECONDS, +
[2/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
http://git-wip-us.apache.org/repos/asf/spark/blob/a2409d1c/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala -- diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala index 110f51a..567d7fa 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala @@ -20,15 +20,18 @@ package org.apache.spark.sql.hive import java.io.File import java.net.{URL, URLClassLoader} import java.sql.Timestamp +import java.util.concurrent.TimeUnit import scala.collection.JavaConversions._ import scala.collection.mutable.HashMap import scala.language.implicitConversions +import scala.concurrent.duration._ import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.hive.common.StatsSetupConst import org.apache.hadoop.hive.common.`type`.HiveDecimal import org.apache.hadoop.hive.conf.HiveConf +import org.apache.hadoop.hive.conf.HiveConf.ConfVars import org.apache.hadoop.hive.ql.metadata.Table import org.apache.hadoop.hive.ql.parse.VariableSubstitution import org.apache.hadoop.hive.ql.session.SessionState @@ -165,6 +168,16 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { SessionState.setCurrentSessionState(executionHive.state) /** + * Overrides default Hive configurations to avoid breaking changes to Spark SQL users. + * - allow SQL11 keywords to be used as identifiers + */ + private[sql] def defaultOverides() = { +setConf(ConfVars.HIVE_SUPPORT_SQL11_RESERVED_KEYWORDS.varname, false) + } + + defaultOverides() + + /** * The copy of the Hive client that is used to retrieve metadata from the Hive MetaStore. * The version of the Hive client that is used here must match the metastore that is configured * in the hive-site.xml file. @@ -252,6 +265,10 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { } protected[sql] override def parseSql(sql: String): LogicalPlan = { +var state = SessionState.get() +if (state == null) { + SessionState.setCurrentSessionState(tlSession.get().asInstanceOf[SQLSession].sessionState) +} super.parseSql(substitutor.substitute(hiveconf, sql)) } @@ -298,10 +315,21 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { // Can we use fs.getContentSummary in future? // Seems fs.getContentSummary returns wrong table size on Jenkins. So we use // countFileSize to count the table size. +val stagingDir = metadataHive.getConf(HiveConf.ConfVars.STAGINGDIR.varname, + HiveConf.ConfVars.STAGINGDIR.defaultStrVal) + def calculateTableSize(fs: FileSystem, path: Path): Long = { val fileStatus = fs.getFileStatus(path) val size = if (fileStatus.isDir) { -fs.listStatus(path).map(status = calculateTableSize(fs, status.getPath)).sum +fs.listStatus(path) + .map { status = +if (!status.getPath().getName().startsWith(stagingDir)) { + calculateTableSize(fs, status.getPath) +} else { + 0L +} + } + .sum } else { fileStatus.getLen } @@ -398,7 +426,58 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) with Logging { } /** Overridden by child classes that need to set configuration before the client init. */ - protected def configure(): Map[String, String] = Map.empty + protected def configure(): Map[String, String] = { +// Hive 0.14.0 introduces timeout operations in HiveConf, and changes default values of a bunch +// of time `ConfVar`s by adding time suffixes (`s`, `ms`, and `d` etc.). This breaks backwards- +// compatibility when users are trying to connecting to a Hive metastore of lower version, +// because these options are expected to be integral values in lower versions of Hive. +// +// Here we enumerate all time `ConfVar`s and convert their values to numeric strings according +// to their output time units. +Seq( + ConfVars.METASTORE_CLIENT_CONNECT_RETRY_DELAY - TimeUnit.SECONDS, + ConfVars.METASTORE_CLIENT_SOCKET_TIMEOUT - TimeUnit.SECONDS, + ConfVars.METASTORE_CLIENT_SOCKET_LIFETIME - TimeUnit.SECONDS, + ConfVars.HMSHANDLERINTERVAL - TimeUnit.MILLISECONDS, + ConfVars.METASTORE_EVENT_DB_LISTENER_TTL - TimeUnit.SECONDS, + ConfVars.METASTORE_EVENT_CLEAN_FREQ - TimeUnit.SECONDS, + ConfVars.METASTORE_EVENT_EXPIRY_DURATION - TimeUnit.SECONDS, + ConfVars.METASTORE_AGGREGATE_STATS_CACHE_TTL - TimeUnit.SECONDS, + ConfVars.METASTORE_AGGREGATE_STATS_CACHE_MAX_WRITER_WAIT - TimeUnit.MILLISECONDS, +
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-snapshot-20150803 [deleted] 4c4f638c7 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-snapshot-20150803 [created] 7e7147f3b - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-snapshot-20150803 [created] 35264204b - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 1.5.0-SNAPSHOT
Preparing development version 1.5.0-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73fab884 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73fab884 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73fab884 Branch: refs/heads/branch-1.5 Commit: 73fab8849f6288f36101f52d663a6e7339b6576e Parents: 3526420 Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:37:34 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:37:34 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 3ef7d6f..e9c6d26 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 684e07b..ed5c37e 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index bb25652..0e53a79 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 9ef1eda..e6884b0 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 6377c3e..1318959 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/73fab884/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index 7d72f78..0664cfb 100644 --- a/external/flume-sink/pom.xml
[1/2] spark git commit: Preparing Spark release v1.5.0-snapshot-20150803
Repository: spark Updated Branches: refs/heads/branch-1.5 6bd12e819 - 73fab8849 Preparing Spark release v1.5.0-snapshot-20150803 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/35264204 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/35264204 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/35264204 Branch: refs/heads/branch-1.5 Commit: 35264204b8e06c37ca99dd5c769aac20bdab161b Parents: 6bd12e8 Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:37:27 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:37:27 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index e9c6d26..3ef7d6f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index ed5c37e..684e07b 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 0e53a79..bb25652 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index e6884b0..9ef1eda 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 1318959..6377c3e 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/35264204/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b
spark git commit: [SPARK-8416] highlight and topping the executor threads in thread dumping page
Repository: spark Updated Branches: refs/heads/branch-1.5 34335719a - 93076ae39 [SPARK-8416] highlight and topping the executor threads in thread dumping page https://issues.apache.org/jira/browse/SPARK-8416 To facilitate debugging, I made this patch with three changes: * render the executor-thread and non executor-thread entries with different background colors * put the executor threads on the top of the list * sort the threads alphabetically Author: CodingCat zhunans...@gmail.com Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits: 34fc708 [CodingCat] fix className d7b79dd [CodingCat] lowercase threadName d032882 [CodingCat] sort alphabetically and change the css class name f0513b1 [CodingCat] change the color group threads by name 2da6e06 [CodingCat] small fix 3fc9f36 [CodingCat] define classes in webui.css 8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page (cherry picked from commit 3b0e44490aebfba30afc147e4a34a63439d985c6) Signed-off-by: Josh Rosen joshro...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/93076ae3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/93076ae3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/93076ae3 Branch: refs/heads/branch-1.5 Commit: 93076ae39b58ba8c4a459f2b3a8590c492dc5c4e Parents: 3433571 Author: CodingCat zhunans...@gmail.com Authored: Mon Aug 3 18:20:40 2015 -0700 Committer: Josh Rosen joshro...@databricks.com Committed: Mon Aug 3 18:20:56 2015 -0700 -- .../org/apache/spark/ui/static/webui.css| 8 +++ .../spark/ui/exec/ExecutorThreadDumpPage.scala | 24 +--- 2 files changed, 29 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/93076ae3/core/src/main/resources/org/apache/spark/ui/static/webui.css -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/webui.css b/core/src/main/resources/org/apache/spark/ui/static/webui.css index 648cd1b..04f3070 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/webui.css +++ b/core/src/main/resources/org/apache/spark/ui/static/webui.css @@ -224,3 +224,11 @@ span.additional-metric-title { a.expandbutton { cursor: pointer; } + +.executor-thread { + background: #E6E6E6; +} + +.non-executor-thread { + background: #FAFAFA; +} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/93076ae3/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala index f0ae95b..b0a2cb4 100644 --- a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala +++ b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala @@ -49,11 +49,29 @@ private[ui] class ExecutorThreadDumpPage(parent: ExecutorsTab) extends WebUIPage val maybeThreadDump = sc.get.getExecutorThreadDump(executorId) val content = maybeThreadDump.map { threadDump = - val dumpRows = threadDump.map { thread = + val dumpRows = threadDump.sortWith { +case (threadTrace1, threadTrace2) = { + val v1 = if (threadTrace1.threadName.contains(Executor task launch)) 1 else 0 + val v2 = if (threadTrace2.threadName.contains(Executor task launch)) 1 else 0 + if (v1 == v2) { +threadTrace1.threadName.toLowerCase threadTrace2.threadName.toLowerCase + } else { +v1 v2 + } +} + }.map { thread = +val threadName = thread.threadName +val className = accordion-heading + { + if (threadName.contains(Executor task launch)) { +executor-thread + } else { +non-executor-thread + } +} div class=accordion-group - div class=accordion-heading onclick=$(this).next().toggleClass('hidden') + div class={className} onclick=$(this).next().toggleClass('hidden') a class=accordion-toggle - Thread {thread.threadId}: {thread.threadName} ({thread.threadState}) + Thread {thread.threadId}: {threadName} ({thread.threadState}) /a /div div class=accordion-body hidden - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8416] highlight and topping the executor threads in thread dumping page
Repository: spark Updated Branches: refs/heads/master 1633d0a26 - 3b0e44490 [SPARK-8416] highlight and topping the executor threads in thread dumping page https://issues.apache.org/jira/browse/SPARK-8416 To facilitate debugging, I made this patch with three changes: * render the executor-thread and non executor-thread entries with different background colors * put the executor threads on the top of the list * sort the threads alphabetically Author: CodingCat zhunans...@gmail.com Closes #7808 from CodingCat/SPARK-8416 and squashes the following commits: 34fc708 [CodingCat] fix className d7b79dd [CodingCat] lowercase threadName d032882 [CodingCat] sort alphabetically and change the css class name f0513b1 [CodingCat] change the color group threads by name 2da6e06 [CodingCat] small fix 3fc9f36 [CodingCat] define classes in webui.css 8ee125e [CodingCat] highlight and put on top the executor threads in thread dumping page Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b0e4449 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b0e4449 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b0e4449 Branch: refs/heads/master Commit: 3b0e44490aebfba30afc147e4a34a63439d985c6 Parents: 1633d0a Author: CodingCat zhunans...@gmail.com Authored: Mon Aug 3 18:20:40 2015 -0700 Committer: Josh Rosen joshro...@databricks.com Committed: Mon Aug 3 18:20:40 2015 -0700 -- .../org/apache/spark/ui/static/webui.css| 8 +++ .../spark/ui/exec/ExecutorThreadDumpPage.scala | 24 +--- 2 files changed, 29 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3b0e4449/core/src/main/resources/org/apache/spark/ui/static/webui.css -- diff --git a/core/src/main/resources/org/apache/spark/ui/static/webui.css b/core/src/main/resources/org/apache/spark/ui/static/webui.css index 648cd1b..04f3070 100644 --- a/core/src/main/resources/org/apache/spark/ui/static/webui.css +++ b/core/src/main/resources/org/apache/spark/ui/static/webui.css @@ -224,3 +224,11 @@ span.additional-metric-title { a.expandbutton { cursor: pointer; } + +.executor-thread { + background: #E6E6E6; +} + +.non-executor-thread { + background: #FAFAFA; +} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/spark/blob/3b0e4449/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala -- diff --git a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala index f0ae95b..b0a2cb4 100644 --- a/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala +++ b/core/src/main/scala/org/apache/spark/ui/exec/ExecutorThreadDumpPage.scala @@ -49,11 +49,29 @@ private[ui] class ExecutorThreadDumpPage(parent: ExecutorsTab) extends WebUIPage val maybeThreadDump = sc.get.getExecutorThreadDump(executorId) val content = maybeThreadDump.map { threadDump = - val dumpRows = threadDump.map { thread = + val dumpRows = threadDump.sortWith { +case (threadTrace1, threadTrace2) = { + val v1 = if (threadTrace1.threadName.contains(Executor task launch)) 1 else 0 + val v2 = if (threadTrace2.threadName.contains(Executor task launch)) 1 else 0 + if (v1 == v2) { +threadTrace1.threadName.toLowerCase threadTrace2.threadName.toLowerCase + } else { +v1 v2 + } +} + }.map { thread = +val threadName = thread.threadName +val className = accordion-heading + { + if (threadName.contains(Executor task launch)) { +executor-thread + } else { +non-executor-thread + } +} div class=accordion-group - div class=accordion-heading onclick=$(this).next().toggleClass('hidden') + div class={className} onclick=$(this).next().toggleClass('hidden') a class=accordion-toggle - Thread {thread.threadId}: {thread.threadName} ({thread.threadState}) + Thread {thread.threadId}: {threadName} ({thread.threadState}) /a /div div class=accordion-body hidden - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
Repository: spark Updated Branches: refs/heads/master b2e4b85d2 - a2409d1c8 http://git-wip-us.apache.org/repos/asf/spark/blob/a2409d1c/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 -- diff --git a/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 b/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 deleted file mode 100644 index b70e127e..000 --- a/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 +++ /dev/null @@ -1,500 +0,0 @@ -0 val_0 -0 val_0 -0 val_0 -2 val_2 -4 val_4 -5 val_5 -5 val_5 -5 val_5 -8 val_8 -9 val_9 -10 val_10 -11 val_11 -12 val_12 -12 val_12 -15 val_15 -15 val_15 -17 val_17 -18 val_18 -18 val_18 -19 val_19 -20 val_20 -24 val_24 -24 val_24 -26 val_26 -26 val_26 -27 val_27 -28 val_28 -30 val_30 -33 val_33 -34 val_34 -35 val_35 -35 val_35 -35 val_35 -37 val_37 -37 val_37 -41 val_41 -42 val_42 -42 val_42 -43 val_43 -44 val_44 -47 val_47 -51 val_51 -51 val_51 -53 val_53 -54 val_54 -57 val_57 -58 val_58 -58 val_58 -64 val_64 -65 val_65 -66 val_66 -67 val_67 -67 val_67 -69 val_69 -70 val_70 -70 val_70 -70 val_70 -72 val_72 -72 val_72 -74 val_74 -76 val_76 -76 val_76 -77 val_77 -78 val_78 -80 val_80 -82 val_82 -83 val_83 -83 val_83 -84 val_84 -84 val_84 -85 val_85 -86 val_86 -87 val_87 -90 val_90 -90 val_90 -90 val_90 -92 val_92 -95 val_95 -95 val_95 -96 val_96 -97 val_97 -97 val_97 -98 val_98 -98 val_98 -100val_100 -100val_100 -103val_103 -103val_103 -104val_104 -104val_104 -105val_105 -111val_111 -113val_113 -113val_113 -114val_114 -116val_116 -118val_118 -118val_118 -119val_119 -119val_119 -119val_119 -120val_120 -120val_120 -125val_125 -125val_125 -126val_126 -128val_128 -128val_128 -128val_128 -129val_129 -129val_129 -131val_131 -133val_133 -134val_134 -134val_134 -136val_136 -137val_137 -137val_137 -138val_138 -138val_138 -138val_138 -138val_138 -143val_143 -145val_145 -146val_146 -146val_146 -149val_149 -149val_149 -150val_150 -152val_152 -152val_152 -153val_153 -155val_155 -156val_156 -157val_157 -158val_158 -160val_160 -162val_162 -163val_163 -164val_164 -164val_164 -165val_165 -165val_165 -166val_166 -167val_167 -167val_167 -167val_167 -168val_168 -169val_169 -169val_169 -169val_169 -169val_169 -170val_170 -172val_172 -172val_172 -174val_174 -174val_174 -175val_175 -175val_175 -176val_176 -176val_176 -177val_177 -178val_178 -179val_179 -179val_179 -180val_180 -181val_181 -183val_183 -186val_186 -187val_187 -187val_187 -187val_187 -189val_189 -190val_190 -191val_191 -191val_191 -192val_192 -193val_193 -193val_193 -193val_193 -194val_194 -195val_195 -195val_195 -196val_196 -197val_197 -197val_197 -199val_199 -199val_199 -199val_199 -200val_200 -200val_200 -201val_201 -202val_202 -203val_203 -203val_203 -205val_205 -205val_205 -207val_207 -207val_207 -208val_208 -208val_208 -208val_208 -209val_209 -209val_209 -213val_213 -213val_213 -214val_214 -216val_216 -216val_216 -217val_217 -217val_217 -218val_218 -219val_219 -219val_219 -221val_221 -221val_221 -222val_222 -223val_223 -223val_223 -224val_224 -224val_224 -226val_226 -228val_228 -229val_229 -229val_229 -230val_230 -230val_230 -230val_230 -230val_230 -230val_230 -233val_233 -233val_233 -235val_235 -237val_237 -237val_237 -238val_238 -238val_238 -239val_239 -239val_239 -241val_241 -242val_242 -242val_242 -244val_244 -247val_247 -248val_248 -249val_249 -252val_252 -255val_255 -255val_255 -256val_256 -256val_256 -257val_257 -258val_258 -260val_260 -262val_262 -263val_263 -265val_265 -265val_265 -266val_266 -272val_272 -272val_272 -273val_273 -273val_273 -273val_273 -274val_274 -275val_275 -277val_277 -277val_277 -277val_277 -277val_277 -278val_278 -278val_278 -280val_280 -280val_280 -281val_281 -281val_281
[3/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
[SPARK-8064] [SQL] Build against Hive 1.2.1 Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran ste...@hortonworks.com Author: Cheng Lian l...@databricks.com Author: Michael Armbrust mich...@databricks.com Author: Patrick Wendell patr...@databricks.com Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is by hand c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue* 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
[1/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
Repository: spark Updated Branches: refs/heads/branch-1.5 db5832708 - 6bd12e819 http://git-wip-us.apache.org/repos/asf/spark/blob/6bd12e81/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 -- diff --git a/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 b/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 deleted file mode 100644 index b70e127e..000 --- a/sql/hive/src/test/resources/golden/parenthesis_star_by-5-6888c7f7894910538d82eefa23443189 +++ /dev/null @@ -1,500 +0,0 @@ -0 val_0 -0 val_0 -0 val_0 -2 val_2 -4 val_4 -5 val_5 -5 val_5 -5 val_5 -8 val_8 -9 val_9 -10 val_10 -11 val_11 -12 val_12 -12 val_12 -15 val_15 -15 val_15 -17 val_17 -18 val_18 -18 val_18 -19 val_19 -20 val_20 -24 val_24 -24 val_24 -26 val_26 -26 val_26 -27 val_27 -28 val_28 -30 val_30 -33 val_33 -34 val_34 -35 val_35 -35 val_35 -35 val_35 -37 val_37 -37 val_37 -41 val_41 -42 val_42 -42 val_42 -43 val_43 -44 val_44 -47 val_47 -51 val_51 -51 val_51 -53 val_53 -54 val_54 -57 val_57 -58 val_58 -58 val_58 -64 val_64 -65 val_65 -66 val_66 -67 val_67 -67 val_67 -69 val_69 -70 val_70 -70 val_70 -70 val_70 -72 val_72 -72 val_72 -74 val_74 -76 val_76 -76 val_76 -77 val_77 -78 val_78 -80 val_80 -82 val_82 -83 val_83 -83 val_83 -84 val_84 -84 val_84 -85 val_85 -86 val_86 -87 val_87 -90 val_90 -90 val_90 -90 val_90 -92 val_92 -95 val_95 -95 val_95 -96 val_96 -97 val_97 -97 val_97 -98 val_98 -98 val_98 -100val_100 -100val_100 -103val_103 -103val_103 -104val_104 -104val_104 -105val_105 -111val_111 -113val_113 -113val_113 -114val_114 -116val_116 -118val_118 -118val_118 -119val_119 -119val_119 -119val_119 -120val_120 -120val_120 -125val_125 -125val_125 -126val_126 -128val_128 -128val_128 -128val_128 -129val_129 -129val_129 -131val_131 -133val_133 -134val_134 -134val_134 -136val_136 -137val_137 -137val_137 -138val_138 -138val_138 -138val_138 -138val_138 -143val_143 -145val_145 -146val_146 -146val_146 -149val_149 -149val_149 -150val_150 -152val_152 -152val_152 -153val_153 -155val_155 -156val_156 -157val_157 -158val_158 -160val_160 -162val_162 -163val_163 -164val_164 -164val_164 -165val_165 -165val_165 -166val_166 -167val_167 -167val_167 -167val_167 -168val_168 -169val_169 -169val_169 -169val_169 -169val_169 -170val_170 -172val_172 -172val_172 -174val_174 -174val_174 -175val_175 -175val_175 -176val_176 -176val_176 -177val_177 -178val_178 -179val_179 -179val_179 -180val_180 -181val_181 -183val_183 -186val_186 -187val_187 -187val_187 -187val_187 -189val_189 -190val_190 -191val_191 -191val_191 -192val_192 -193val_193 -193val_193 -193val_193 -194val_194 -195val_195 -195val_195 -196val_196 -197val_197 -197val_197 -199val_199 -199val_199 -199val_199 -200val_200 -200val_200 -201val_201 -202val_202 -203val_203 -203val_203 -205val_205 -205val_205 -207val_207 -207val_207 -208val_208 -208val_208 -208val_208 -209val_209 -209val_209 -213val_213 -213val_213 -214val_214 -216val_216 -216val_216 -217val_217 -217val_217 -218val_218 -219val_219 -219val_219 -221val_221 -221val_221 -222val_222 -223val_223 -223val_223 -224val_224 -224val_224 -226val_226 -228val_228 -229val_229 -229val_229 -230val_230 -230val_230 -230val_230 -230val_230 -230val_230 -233val_233 -233val_233 -235val_235 -237val_237 -237val_237 -238val_238 -238val_238 -239val_239 -239val_239 -241val_241 -242val_242 -242val_242 -244val_244 -247val_247 -248val_248 -249val_249 -252val_252 -255val_255 -255val_255 -256val_256 -256val_256 -257val_257 -258val_258 -260val_260 -262val_262 -263val_263 -265val_265 -265val_265 -266val_266 -272val_272 -272val_272 -273val_273 -273val_273 -273val_273 -274val_274 -275val_275 -277val_277 -277val_277 -277val_277 -277val_277 -278val_278 -278val_278 -280val_280 -280val_280 -281val_281 -281
[3/3] spark git commit: [SPARK-8064] [SQL] Build against Hive 1.2.1
[SPARK-8064] [SQL] Build against Hive 1.2.1 Cherry picked the parts of the initial SPARK-8064 WiP branch needed to get sql/hive to compile against hive 1.2.1. That's the ASF release packaged under org.apache.hive, not any fork. Tests not run yet: that's what the machines are for Author: Steve Loughran ste...@hortonworks.com Author: Cheng Lian l...@databricks.com Author: Michael Armbrust mich...@databricks.com Author: Patrick Wendell patr...@databricks.com Closes #7191 from steveloughran/stevel/feature/SPARK-8064-hive-1.2-002 and squashes the following commits: 7556d85 [Cheng Lian] Updates .q files and corresponding golden files ef4af62 [Steve Loughran] Merge commit '6a92bb09f46a04d6cd8c41bdba3ecb727ebb9030' into stevel/feature/SPARK-8064-hive-1.2-002 6a92bb0 [Cheng Lian] Overrides HiveConf time vars dcbb391 [Cheng Lian] Adds com.twitter:parquet-hadoop-bundle:1.6.0 for Hive Parquet SerDe 0bbe475 [Steve Loughran] SPARK-8064 scalastyle rejects the standard Hadoop ASF license header... fdf759b [Steve Loughran] SPARK-8064 classpath dependency suite to be in sync with shading in final (?) hive-exec spark 7a6c727 [Steve Loughran] SPARK-8064 switch to second staging repo of the spark-hive artifacts. This one has the protobuf-shaded hive-exec jar 376c003 [Steve Loughran] SPARK-8064 purge duplicate protobuf declaration 2c74697 [Steve Loughran] SPARK-8064 switch to the protobuf shaded hive-exec jar with tests to chase it down cc44020 [Steve Loughran] SPARK-8064 remove hadoop.version from runtest.py, as profile will fix that automatically. 6901fa9 [Steve Loughran] SPARK-8064 explicit protobuf import da310dc [Michael Armbrust] Fixes for Hive tests. a775a75 [Steve Loughran] SPARK-8064 cherry-pick-incomplete 7404f34 [Patrick Wendell] Add spark-hive staging repo 832c164 [Steve Loughran] SPARK-8064 try to supress compiler warnings on Complex.java pasted-thrift-code 312c0d4 [Steve Loughran] SPARK-8064 maven/ivy dependency purge; calcite declaration needed fa5ae7b [Steve Loughran] HIVE-8064 fix up hive-thriftserver dependencies and cut back on evicted references in the hive- packages; this keeps mvn and ivy resolution compatible, as the reconciliation policy is by hand c188048 [Steve Loughran] SPARK-8064 manage the Hive depencencies to that -things that aren't needed are excluded -sql/hive built with ivy is in sync with the maven reconciliation policy, rather than latest-first 4c8be8d [Cheng Lian] WIP: Partial fix for Thrift server and CLI tests 314eb3c [Steve Loughran] SPARK-8064 deprecation warning noise in one of the tests 17b0341 [Steve Loughran] SPARK-8064 IDE-hinted cleanups of Complex.java to reduce compiler warnings. It's all autogenerated code, so still ugly. d029b92 [Steve Loughran] SPARK-8064 rely on unescaping to have already taken place, so go straight to map of serde options 23eca7e [Steve Loughran] HIVE-8064 handle raw and escaped property tokens 54d9b06 [Steve Loughran] SPARK-8064 fix compilation regression surfacing from rebase 0b12d5f [Steve Loughran] HIVE-8064 use subset of hive complex type whose types deserialize fce73b6 [Steve Loughran] SPARK-8064 poms rely implicitly on the version of kryo chill provides fd3aa5d [Steve Loughran] SPARK-8064 version of hive to d/l from ivy is 1.2.1 dc73ece [Steve Loughran] SPARK-8064 revert to master's determinstic pushdown strategy d3c1e4a [Steve Loughran] SPARK-8064 purge UnionType 051cc21 [Steve Loughran] SPARK-8064 switch to an unshaded version of hive-exec-core, which must have been built with Kryo 2.21. This currently looks for a (locally built) version 1.2.1.spark 6684c60 [Steve Loughran] SPARK-8064 ignore RTE raised in blocking process.exitValue() call e6121e5 [Steve Loughran] SPARK-8064 address review comments aa43dc6 [Steve Loughran] SPARK-8064 more robust teardown on JavaMetastoreDatasourcesSuite f2bff01 [Steve Loughran] SPARK-8064 better takeup of asynchronously caught error text 8b1ef38 [Steve Loughran] SPARK-8064: on failures executing spark-submit in HiveSparkSubmitSuite, print command line and all logged output. 5a9ce6b [Steve Loughran] SPARK-8064 add explicit reason for kv split failure, rather than array OOB. *does not address the issue* 642b63a [Steve Loughran] SPARK-8064 reinstate something cut briefly during rebasing 97194dc [Steve Loughran] SPARK-8064 add extra logging to the YarnClusterSuite classpath test. There should be no reason why this is failing on jenkins, but as it is (and presumably its CP-related), improve the logging including any exception raised. 335357f [Steve Loughran] SPARK-8064 fail fast on thrive process spawning tests on exit codes and/or error string patterns seen in log. 3ed872f [Steve Loughran] SPARK-8064 rename field double to dbl bca55e5 [Steve Loughran] SPARK-8064 missed one of the `date` escapes 41d6479 [Steve Loughran] SPARK-8064 wrap tests with withTable() calls to avoid table-exists exceptions 2bc29a4 [Steve Loughran] SPARK-8064 ParquetSuites to escape `date` field name
Git Push Summary
Repository: spark Updated Tags: refs/tags/v1.5.0-snapshot-20150803 [created] 4c4f638c7 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/2] spark git commit: Preparing development version 1.5.0-SNAPSHOT
Preparing development version 1.5.0-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bc49ca46 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bc49ca46 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bc49ca46 Branch: refs/heads/branch-1.5 Commit: bc49ca468d3abe4949382a32de92f963f454d36a Parents: 4c4f638 Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:54:56 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:54:56 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 3ef7d6f..e9c6d26 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 684e07b..ed5c37e 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index bb25652..0e53a79 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 9ef1eda..e6884b0 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 6377c3e..1318959 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/bc49ca46/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index 7d72f78..0664cfb 100644 --- a/external/flume-sink/pom.xml
[1/2] spark git commit: Preparing Spark release v1.5.0-snapshot-20150803
Repository: spark Updated Branches: refs/heads/branch-1.5 acda9d954 - bc49ca468 Preparing Spark release v1.5.0-snapshot-20150803 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4c4f638c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4c4f638c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4c4f638c Branch: refs/heads/branch-1.5 Commit: 4c4f638c7333b44049c75ae34486148ab74db333 Parents: acda9d9 Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:54:50 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:54:50 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index e9c6d26..3ef7d6f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index ed5c37e..684e07b 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 0e53a79..bb25652 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index e6884b0..9ef1eda 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 1318959..6377c3e 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/4c4f638c/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b
spark git commit: [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
Repository: spark Updated Branches: refs/heads/master 3b0e44490 - 5eb89f67e [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. Author: Reynold Xin r...@databricks.com Closes #7911 from rxin/surface-concrete-type and squashes the following commits: 0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5eb89f67 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5eb89f67 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5eb89f67 Branch: refs/heads/master Commit: 5eb89f67e323dcf9fa3d5b30f9b5cb8f10ca1e8c Parents: 3b0e444 Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 18:47:02 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 18:47:02 2015 -0700 -- .../unsafe/sort/UnsafeExternalSorter.java | 2 +- .../unsafe/sort/UnsafeInMemorySorter.java | 6 +- .../sql/execution/UnsafeKVExternalSorter.java | 112 ++- .../UnsafeHybridAggregationIterator.scala | 30 + 4 files changed, 65 insertions(+), 85 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5eb89f67/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index bf5f965..dec7fcf 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -428,7 +428,7 @@ public final class UnsafeExternalSorter { public UnsafeSorterIterator getSortedIterator() throws IOException { assert(inMemSorter != null); -final UnsafeSorterIterator inMemoryIterator = inMemSorter.getSortedIterator(); +final UnsafeInMemorySorter.SortedIterator inMemoryIterator = inMemSorter.getSortedIterator(); int numIteratorsToMerge = spillWriters.size() + (inMemoryIterator.hasNext() ? 1 : 0); if (spillWriters.isEmpty()) { return inMemoryIterator; http://git-wip-us.apache.org/repos/asf/spark/blob/5eb89f67/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java index 3131465..1e4b8a1 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java @@ -133,7 +133,7 @@ public final class UnsafeInMemorySorter { pointerArrayInsertPosition++; } - private static final class SortedIterator extends UnsafeSorterIterator { + public static final class SortedIterator extends UnsafeSorterIterator { private final TaskMemoryManager memoryManager; private final int sortBufferInsertPosition; @@ -144,7 +144,7 @@ public final class UnsafeInMemorySorter { private long keyPrefix; private int recordLength; -SortedIterator( +private SortedIterator( TaskMemoryManager memoryManager, int sortBufferInsertPosition, long[] sortBuffer) { @@ -186,7 +186,7 @@ public final class UnsafeInMemorySorter { * Return an iterator over record pointers in sorted order. For efficiency, all calls to * {@code next()} will return the same mutable object. */ - public UnsafeSorterIterator getSortedIterator() { + public SortedIterator getSortedIterator() { sorter.sort(pointerArray, 0, pointerArrayInsertPosition / 2, sortComparator); return new SortedIterator(memoryManager, pointerArrayInsertPosition, pointerArray); } http://git-wip-us.apache.org/repos/asf/spark/blob/5eb89f67/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java b/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java index f6b0176..312ec8e 100644 ---
spark git commit: [SPARK-9577][SQL] Surface concrete iterator types in various sort classes.
Repository: spark Updated Branches: refs/heads/branch-1.5 93076ae39 - ebe42b98c [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. We often return abstract iterator types in various sort-related classes (e.g. UnsafeKVExternalSorter). It is actually better to return a more concrete type, so the callsite uses that type and JIT can inline the iterator calls. Author: Reynold Xin r...@databricks.com Closes #7911 from rxin/surface-concrete-type and squashes the following commits: 0422add [Reynold Xin] [SPARK-9577][SQL] Surface concrete iterator types in various sort classes. (cherry picked from commit 5eb89f67e323dcf9fa3d5b30f9b5cb8f10ca1e8c) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ebe42b98 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ebe42b98 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ebe42b98 Branch: refs/heads/branch-1.5 Commit: ebe42b98c8fa0cac6ec267e895402cebe8a670a9 Parents: 93076ae Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 18:47:02 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 18:47:14 2015 -0700 -- .../unsafe/sort/UnsafeExternalSorter.java | 2 +- .../unsafe/sort/UnsafeInMemorySorter.java | 6 +- .../sql/execution/UnsafeKVExternalSorter.java | 112 ++- .../UnsafeHybridAggregationIterator.scala | 30 + 4 files changed, 65 insertions(+), 85 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ebe42b98/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java index bf5f965..dec7fcf 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java @@ -428,7 +428,7 @@ public final class UnsafeExternalSorter { public UnsafeSorterIterator getSortedIterator() throws IOException { assert(inMemSorter != null); -final UnsafeSorterIterator inMemoryIterator = inMemSorter.getSortedIterator(); +final UnsafeInMemorySorter.SortedIterator inMemoryIterator = inMemSorter.getSortedIterator(); int numIteratorsToMerge = spillWriters.size() + (inMemoryIterator.hasNext() ? 1 : 0); if (spillWriters.isEmpty()) { return inMemoryIterator; http://git-wip-us.apache.org/repos/asf/spark/blob/ebe42b98/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java -- diff --git a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java index 3131465..1e4b8a1 100644 --- a/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java +++ b/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java @@ -133,7 +133,7 @@ public final class UnsafeInMemorySorter { pointerArrayInsertPosition++; } - private static final class SortedIterator extends UnsafeSorterIterator { + public static final class SortedIterator extends UnsafeSorterIterator { private final TaskMemoryManager memoryManager; private final int sortBufferInsertPosition; @@ -144,7 +144,7 @@ public final class UnsafeInMemorySorter { private long keyPrefix; private int recordLength; -SortedIterator( +private SortedIterator( TaskMemoryManager memoryManager, int sortBufferInsertPosition, long[] sortBuffer) { @@ -186,7 +186,7 @@ public final class UnsafeInMemorySorter { * Return an iterator over record pointers in sorted order. For efficiency, all calls to * {@code next()} will return the same mutable object. */ - public UnsafeSorterIterator getSortedIterator() { + public SortedIterator getSortedIterator() { sorter.sort(pointerArray, 0, pointerArrayInsertPosition / 2, sortComparator); return new SortedIterator(memoryManager, pointerArrayInsertPosition, pointerArray); } http://git-wip-us.apache.org/repos/asf/spark/blob/ebe42b98/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java
spark git commit: [SPARK-8874] [ML] Add missing methods in Word2Vec
Repository: spark Updated Branches: refs/heads/master a2409d1c8 - 13675c742 [SPARK-8874] [ML] Add missing methods in Word2Vec Add missing methods 1. getVectors 2. findSynonyms to W2Vec scala and python API mengxr Author: MechCoder manojkumarsivaraj...@gmail.com Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits: 149d5ca [MechCoder] minor doc 69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13675c74 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13675c74 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13675c74 Branch: refs/heads/master Commit: 13675c742a71cbdc8324701c3694775ce1dd5c62 Parents: a2409d1 Author: MechCoder manojkumarsivaraj...@gmail.com Authored: Mon Aug 3 16:44:25 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 16:44:25 2015 -0700 -- .../org/apache/spark/ml/feature/Word2Vec.scala | 38 +++- .../apache/spark/ml/feature/Word2VecSuite.scala | 62 2 files changed, 99 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/13675c74/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala index 6ea6590..b4f46ce 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala @@ -18,15 +18,17 @@ package org.apache.spark.ml.feature import org.apache.spark.annotation.Experimental +import org.apache.spark.SparkContext import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.mllib.feature -import org.apache.spark.mllib.linalg.{VectorUDT, Vectors} +import org.apache.spark.mllib.linalg.{VectorUDT, Vector, Vectors} import org.apache.spark.mllib.linalg.BLAS._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ +import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types._ /** @@ -146,6 +148,40 @@ class Word2VecModel private[ml] ( wordVectors: feature.Word2VecModel) extends Model[Word2VecModel] with Word2VecBase { + + /** + * Returns a dataframe with two fields, word and vector, with word being a String and + * and the vector the DenseVector that it is mapped to. + */ + val getVectors: DataFrame = { +val sc = SparkContext.getOrCreate() +val sqlContext = SQLContext.getOrCreate(sc) +import sqlContext.implicits._ +val wordVec = wordVectors.getVectors.mapValues(vec = Vectors.dense(vec.map(_.toDouble))) +sc.parallelize(wordVec.toSeq).toDF(word, vector) + } + + /** + * Find num number of words closest in similarity to the given word. + * Returns a dataframe with the words and the cosine similarities between the + * synonyms and the given word. + */ + def findSynonyms(word: String, num: Int): DataFrame = { +findSynonyms(wordVectors.transform(word), num) + } + + /** + * Find num number of words closest to similarity to the given vector representation + * of the word. Returns a dataframe with the words and the cosine similarities between the + * synonyms and the given word vector. + */ + def findSynonyms(word: Vector, num: Int): DataFrame = { +val sc = SparkContext.getOrCreate() +val sqlContext = SQLContext.getOrCreate(sc) +import sqlContext.implicits._ +sc.parallelize(wordVectors.findSynonyms(word, num)).toDF(word, similarity) + } + /** @group setParam */ def setInputCol(value: String): this.type = set(inputCol, value) http://git-wip-us.apache.org/repos/asf/spark/blob/13675c74/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala index aa6ce53..adcda0e 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala @@ -67,5 +67,67 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(vector1 ~== vector2 absTol 1E-5, Transformed vector is different with expected.) } } + + test(getVectors) { + +val sqlContext = new SQLContext(sc) +import sqlContext.implicits._ + +val sentence = a b * 100 + a c * 10 +val doc
[2/2] spark git commit: Preparing development version 1.5.0-SNAPSHOT
Preparing development version 1.5.0-SNAPSHOT Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74792e71 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74792e71 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74792e71 Branch: refs/heads/branch-1.5 Commit: 74792e71cb0584637041cb81660ec3ac4ea10c0b Parents: 7e7147f Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:59:19 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:59:19 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 3ef7d6f..e9c6d26 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index 684e07b..ed5c37e 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index bb25652..0e53a79 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index 9ef1eda..e6884b0 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 6377c3e..1318959 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0/version +version1.5.0-SNAPSHOT/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/74792e71/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b/external/flume-sink/pom.xml index 7d72f78..0664cfb 100644 --- a/external/flume-sink/pom.xml
[1/2] spark git commit: Preparing Spark release v1.5.0-snapshot-20150803
Repository: spark Updated Branches: refs/heads/branch-1.5 bc49ca468 - 74792e71c Preparing Spark release v1.5.0-snapshot-20150803 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7e7147f3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7e7147f3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7e7147f3 Branch: refs/heads/branch-1.5 Commit: 7e7147f3b8fee3ac4f2f1d14c3e6776a4d76bb3a Parents: bc49ca4 Author: Patrick Wendell pwend...@gmail.com Authored: Mon Aug 3 16:59:13 2015 -0700 Committer: Patrick Wendell pwend...@gmail.com Committed: Mon Aug 3 16:59:13 2015 -0700 -- assembly/pom.xml| 2 +- bagel/pom.xml | 2 +- core/pom.xml| 2 +- examples/pom.xml| 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml | 2 +- external/kafka-assembly/pom.xml | 2 +- external/kafka/pom.xml | 2 +- external/mqtt/pom.xml | 2 +- external/twitter/pom.xml| 2 +- external/zeromq/pom.xml | 2 +- extras/java8-tests/pom.xml | 2 +- extras/kinesis-asl-assembly/pom.xml | 2 +- extras/kinesis-asl/pom.xml | 2 +- extras/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml | 2 +- launcher/pom.xml| 2 +- mllib/pom.xml | 2 +- network/common/pom.xml | 2 +- network/shuffle/pom.xml | 2 +- network/yarn/pom.xml| 2 +- pom.xml | 2 +- repl/pom.xml| 2 +- sql/catalyst/pom.xml| 2 +- sql/core/pom.xml| 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml| 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- unsafe/pom.xml | 2 +- yarn/pom.xml| 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index e9c6d26..3ef7d6f 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/bagel/pom.xml -- diff --git a/bagel/pom.xml b/bagel/pom.xml index ed5c37e..684e07b 100644 --- a/bagel/pom.xml +++ b/bagel/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/core/pom.xml -- diff --git a/core/pom.xml b/core/pom.xml index 0e53a79..bb25652 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/examples/pom.xml -- diff --git a/examples/pom.xml b/examples/pom.xml index e6884b0..9ef1eda 100644 --- a/examples/pom.xml +++ b/examples/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/external/flume-assembly/pom.xml -- diff --git a/external/flume-assembly/pom.xml b/external/flume-assembly/pom.xml index 1318959..6377c3e 100644 --- a/external/flume-assembly/pom.xml +++ b/external/flume-assembly/pom.xml @@ -21,7 +21,7 @@ parent groupIdorg.apache.spark/groupId artifactIdspark-parent_2.10/artifactId -version1.5.0-SNAPSHOT/version +version1.5.0/version relativePath../../pom.xml/relativePath /parent http://git-wip-us.apache.org/repos/asf/spark/blob/7e7147f3/external/flume-sink/pom.xml -- diff --git a/external/flume-sink/pom.xml b
spark git commit: Add a prerequisites section for building docs
Repository: spark Updated Branches: refs/heads/master 13675c742 - 7abaaad5b Add a prerequisites section for building docs This puts all the install commands that need to be run in one section instead of being spread over many paragraphs cc rxin Author: Shivaram Venkataraman shiva...@cs.berkeley.edu Closes #7912 from shivaram/docs-setup-readme and squashes the following commits: cf7a204 [Shivaram Venkataraman] Add a prerequisites section for building docs Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7abaaad5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7abaaad5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7abaaad5 Branch: refs/heads/master Commit: 7abaaad5b169520fbf7299808b2bafde089a16a2 Parents: 13675c7 Author: Shivaram Venkataraman shiva...@cs.berkeley.edu Authored: Mon Aug 3 17:00:59 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 17:00:59 2015 -0700 -- docs/README.md | 10 ++ 1 file changed, 10 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7abaaad5/docs/README.md -- diff --git a/docs/README.md b/docs/README.md index d7652e9..5020989 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,6 +8,16 @@ Read on to learn more about viewing documentation in plain text (i.e., markdown) documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control. +## Prerequisites +The Spark documenation build uses a number of tools to build HTML docs and API docs in Scala, Python +and R. To get started you can run the following commands + +$ sudo gem install jekyll +$ sudo gem install jekyll-redirect-from +$ sudo pip install Pygments +$ Rscript -e 'install.packages(c(knitr, devtools), repos=http://cran.stat.ucla.edu/;)' + + ## Generating the Documentation HTML We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
Repository: spark Updated Branches: refs/heads/branch-1.5 74792e71c - 73c863ac8 [SPARK-9483] Fix UTF8String.getPrefix for big-endian. Previous code assumed little-endian. Author: Matthew Brandyberry mbra...@us.ibm.com Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits: ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments. 17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian. (cherry picked from commit b79b4f5f2251ed7efeec1f4b26e45a8ea6b85a6a) Signed-off-by: Davies Liu davies@gmail.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73c863ac Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73c863ac Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73c863ac Branch: refs/heads/branch-1.5 Commit: 73c863ac8e8f6cf664f51c64da1da695f341b273 Parents: 74792e7 Author: Matthew Brandyberry mbra...@us.ibm.com Authored: Mon Aug 3 17:36:56 2015 -0700 Committer: Davies Liu davies@gmail.com Committed: Mon Aug 3 17:37:47 2015 -0700 -- .../apache/spark/unsafe/types/UTF8String.java | 40 +++- 1 file changed, 30 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/73c863ac/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java -- diff --git a/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java b/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java index f6c9b87..d80bd57 100644 --- a/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java +++ b/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java @@ -20,6 +20,7 @@ package org.apache.spark.unsafe.types; import javax.annotation.Nonnull; import java.io.Serializable; import java.io.UnsupportedEncodingException; +import java.nio.ByteOrder; import java.util.Arrays; import org.apache.spark.unsafe.PlatformDependent; @@ -53,6 +54,8 @@ public final class UTF8String implements ComparableUTF8String, Serializable { 5, 5, 5, 5, 6, 6}; + private static ByteOrder byteOrder = ByteOrder.nativeOrder(); + public static final UTF8String EMPTY_UTF8 = UTF8String.fromString(); /** @@ -175,18 +178,35 @@ public final class UTF8String implements ComparableUTF8String, Serializable { // If size is greater than 4, assume we have at least 8 bytes of data to fetch. // After getting the data, we use a mask to mask out data that is not part of the string. long p; -if (numBytes = 8) { - p = PlatformDependent.UNSAFE.getLong(base, offset); -} else if (numBytes 4) { - p = PlatformDependent.UNSAFE.getLong(base, offset); - p = p ((1L numBytes * 8) - 1); -} else if (numBytes 0) { - p = (long) PlatformDependent.UNSAFE.getInt(base, offset); - p = p ((1L numBytes * 8) - 1); +long mask = 0; +if (byteOrder == ByteOrder.LITTLE_ENDIAN) { + if (numBytes = 8) { +p = PlatformDependent.UNSAFE.getLong(base, offset); + } else if (numBytes 4) { +p = PlatformDependent.UNSAFE.getLong(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else if (numBytes 0) { +p = (long) PlatformDependent.UNSAFE.getInt(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else { +p = 0; + } + p = java.lang.Long.reverseBytes(p); } else { - p = 0; + // byteOrder == ByteOrder.BIG_ENDIAN + if (numBytes = 8) { +p = PlatformDependent.UNSAFE.getLong(base, offset); + } else if (numBytes 4) { +p = PlatformDependent.UNSAFE.getLong(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else if (numBytes 0) { +p = ((long) PlatformDependent.UNSAFE.getInt(base, offset)) 32; +mask = (1L (8 - numBytes) * 8) - 1; + } else { +p = 0; + } } -p = java.lang.Long.reverseBytes(p); +p = ~mask; return p; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9483] Fix UTF8String.getPrefix for big-endian.
Repository: spark Updated Branches: refs/heads/master 7abaaad5b - b79b4f5f2 [SPARK-9483] Fix UTF8String.getPrefix for big-endian. Previous code assumed little-endian. Author: Matthew Brandyberry mbra...@us.ibm.com Closes #7902 from mtbrandy/SPARK-9483 and squashes the following commits: ec31df8 [Matthew Brandyberry] [SPARK-9483] Changes from review comments. 17d54c6 [Matthew Brandyberry] [SPARK-9483] Fix UTF8String.getPrefix for big-endian. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b79b4f5f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b79b4f5f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b79b4f5f Branch: refs/heads/master Commit: b79b4f5f2251ed7efeec1f4b26e45a8ea6b85a6a Parents: 7abaaad Author: Matthew Brandyberry mbra...@us.ibm.com Authored: Mon Aug 3 17:36:56 2015 -0700 Committer: Davies Liu davies@gmail.com Committed: Mon Aug 3 17:36:56 2015 -0700 -- .../apache/spark/unsafe/types/UTF8String.java | 40 +++- 1 file changed, 30 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b79b4f5f/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java -- diff --git a/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java b/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java index f6c9b87..d80bd57 100644 --- a/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java +++ b/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java @@ -20,6 +20,7 @@ package org.apache.spark.unsafe.types; import javax.annotation.Nonnull; import java.io.Serializable; import java.io.UnsupportedEncodingException; +import java.nio.ByteOrder; import java.util.Arrays; import org.apache.spark.unsafe.PlatformDependent; @@ -53,6 +54,8 @@ public final class UTF8String implements ComparableUTF8String, Serializable { 5, 5, 5, 5, 6, 6}; + private static ByteOrder byteOrder = ByteOrder.nativeOrder(); + public static final UTF8String EMPTY_UTF8 = UTF8String.fromString(); /** @@ -175,18 +178,35 @@ public final class UTF8String implements ComparableUTF8String, Serializable { // If size is greater than 4, assume we have at least 8 bytes of data to fetch. // After getting the data, we use a mask to mask out data that is not part of the string. long p; -if (numBytes = 8) { - p = PlatformDependent.UNSAFE.getLong(base, offset); -} else if (numBytes 4) { - p = PlatformDependent.UNSAFE.getLong(base, offset); - p = p ((1L numBytes * 8) - 1); -} else if (numBytes 0) { - p = (long) PlatformDependent.UNSAFE.getInt(base, offset); - p = p ((1L numBytes * 8) - 1); +long mask = 0; +if (byteOrder == ByteOrder.LITTLE_ENDIAN) { + if (numBytes = 8) { +p = PlatformDependent.UNSAFE.getLong(base, offset); + } else if (numBytes 4) { +p = PlatformDependent.UNSAFE.getLong(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else if (numBytes 0) { +p = (long) PlatformDependent.UNSAFE.getInt(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else { +p = 0; + } + p = java.lang.Long.reverseBytes(p); } else { - p = 0; + // byteOrder == ByteOrder.BIG_ENDIAN + if (numBytes = 8) { +p = PlatformDependent.UNSAFE.getLong(base, offset); + } else if (numBytes 4) { +p = PlatformDependent.UNSAFE.getLong(base, offset); +mask = (1L (8 - numBytes) * 8) - 1; + } else if (numBytes 0) { +p = ((long) PlatformDependent.UNSAFE.getInt(base, offset)) 32; +mask = (1L (8 - numBytes) * 8) - 1; + } else { +p = 0; + } } -p = java.lang.Long.reverseBytes(p); +p = ~mask; return p; } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Revert [SPARK-9372] [SQL] Filter nulls in join keys
Repository: spark Updated Branches: refs/heads/branch-1.5 29756ff11 - db5832708 Revert [SPARK-9372] [SQL] Filter nulls in join keys This reverts commit 687c8c37150f4c93f8e57d86bb56321a4891286b. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/db583270 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/db583270 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/db583270 Branch: refs/heads/branch-1.5 Commit: db5832708267f4a8413b0ad19c6a454c93f7800e Parents: 29756ff Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 14:51:36 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 14:51:36 2015 -0700 -- .../catalyst/expressions/nullFunctions.scala| 48 +--- .../sql/catalyst/optimizer/Optimizer.scala | 64 ++--- .../catalyst/plans/logical/basicOperators.scala | 32 +-- .../expressions/ExpressionEvalHelper.scala | 4 +- .../expressions/MathFunctionsSuite.scala| 3 +- .../expressions/NullFunctionsSuite.scala| 49 +--- .../apache/spark/sql/DataFrameNaFunctions.scala | 2 +- .../scala/org/apache/spark/sql/SQLConf.scala| 6 - .../scala/org/apache/spark/sql/SQLContext.scala | 5 +- .../extendedOperatorOptimizations.scala | 160 - .../optimizer/FilterNullsInJoinKeySuite.scala | 236 --- 11 files changed, 37 insertions(+), 572 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/db583270/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala index d58c475..287718f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala @@ -210,58 +210,14 @@ case class IsNotNull(child: Expression) extends UnaryExpression with Predicate { } } -/** - * A predicate that is evaluated to be true if there are at least `n` null values. - */ -case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate { - override def nullable: Boolean = false - override def foldable: Boolean = children.forall(_.foldable) - override def toString: String = sAtLeastNNulls($n, ${children.mkString(,)}) - - private[this] val childrenArray = children.toArray - - override def eval(input: InternalRow): Boolean = { -var numNulls = 0 -var i = 0 -while (i childrenArray.length numNulls n) { - val evalC = childrenArray(i).eval(input) - if (evalC == null) { -numNulls += 1 - } - i += 1 -} -numNulls = n - } - - override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { -val numNulls = ctx.freshName(numNulls) -val code = children.map { e = - val eval = e.gen(ctx) - s -if ($numNulls $n) { - ${eval.code} - if (${eval.isNull}) { -$numNulls += 1; - } -} - -}.mkString(\n) -s - int $numNulls = 0; - $code - boolean ${ev.isNull} = false; - boolean ${ev.primitive} = $numNulls = $n; - - } -} /** * A predicate that is evaluated to be true if there are at least `n` non-null and non-NaN values. */ -case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate { +case class AtLeastNNonNulls(n: Int, children: Seq[Expression]) extends Predicate { override def nullable: Boolean = false override def foldable: Boolean = children.forall(_.foldable) - override def toString: String = sAtLeastNNonNullNans($n, ${children.mkString(,)}) + override def toString: String = sAtLeastNNulls(n, ${children.mkString(,)}) private[this] val childrenArray = children.toArray http://git-wip-us.apache.org/repos/asf/spark/blob/db583270/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index e4b6294..29d706d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -31,14 +31,8 @@ import org.apache.spark.sql.types._ abstract class Optimizer extends RuleExecutor[LogicalPlan] -class DefaultOptimizer extends Optimizer { - - /** - * Override
spark git commit: Revert [SPARK-9372] [SQL] Filter nulls in join keys
Repository: spark Updated Branches: refs/heads/master 702aa9d7f - b2e4b85d2 Revert [SPARK-9372] [SQL] Filter nulls in join keys This reverts commit 687c8c37150f4c93f8e57d86bb56321a4891286b. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2e4b85d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2e4b85d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2e4b85d Branch: refs/heads/master Commit: b2e4b85d2db0320e9cbfaf5a5542f749f1f11cf4 Parents: 702aa9d Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 14:51:15 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 14:51:15 2015 -0700 -- .../catalyst/expressions/nullFunctions.scala| 48 +--- .../sql/catalyst/optimizer/Optimizer.scala | 64 ++--- .../catalyst/plans/logical/basicOperators.scala | 32 +-- .../expressions/ExpressionEvalHelper.scala | 4 +- .../expressions/MathFunctionsSuite.scala| 3 +- .../expressions/NullFunctionsSuite.scala| 49 +--- .../apache/spark/sql/DataFrameNaFunctions.scala | 2 +- .../scala/org/apache/spark/sql/SQLConf.scala| 6 - .../scala/org/apache/spark/sql/SQLContext.scala | 5 +- .../extendedOperatorOptimizations.scala | 160 - .../optimizer/FilterNullsInJoinKeySuite.scala | 236 --- 11 files changed, 37 insertions(+), 572 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b2e4b85d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala index d58c475..287718f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala @@ -210,58 +210,14 @@ case class IsNotNull(child: Expression) extends UnaryExpression with Predicate { } } -/** - * A predicate that is evaluated to be true if there are at least `n` null values. - */ -case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate { - override def nullable: Boolean = false - override def foldable: Boolean = children.forall(_.foldable) - override def toString: String = sAtLeastNNulls($n, ${children.mkString(,)}) - - private[this] val childrenArray = children.toArray - - override def eval(input: InternalRow): Boolean = { -var numNulls = 0 -var i = 0 -while (i childrenArray.length numNulls n) { - val evalC = childrenArray(i).eval(input) - if (evalC == null) { -numNulls += 1 - } - i += 1 -} -numNulls = n - } - - override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { -val numNulls = ctx.freshName(numNulls) -val code = children.map { e = - val eval = e.gen(ctx) - s -if ($numNulls $n) { - ${eval.code} - if (${eval.isNull}) { -$numNulls += 1; - } -} - -}.mkString(\n) -s - int $numNulls = 0; - $code - boolean ${ev.isNull} = false; - boolean ${ev.primitive} = $numNulls = $n; - - } -} /** * A predicate that is evaluated to be true if there are at least `n` non-null and non-NaN values. */ -case class AtLeastNNonNullNans(n: Int, children: Seq[Expression]) extends Predicate { +case class AtLeastNNonNulls(n: Int, children: Seq[Expression]) extends Predicate { override def nullable: Boolean = false override def foldable: Boolean = children.forall(_.foldable) - override def toString: String = sAtLeastNNonNullNans($n, ${children.mkString(,)}) + override def toString: String = sAtLeastNNulls(n, ${children.mkString(,)}) private[this] val childrenArray = children.toArray http://git-wip-us.apache.org/repos/asf/spark/blob/b2e4b85d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala index e4b6294..29d706d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala @@ -31,14 +31,8 @@ import org.apache.spark.sql.types._ abstract class Optimizer extends RuleExecutor[LogicalPlan] -class DefaultOptimizer extends Optimizer { - - /** - * Override to
spark git commit: [SPARK-8874] [ML] Add missing methods in Word2Vec
Repository: spark Updated Branches: refs/heads/branch-1.5 73fab8849 - acda9d954 [SPARK-8874] [ML] Add missing methods in Word2Vec Add missing methods 1. getVectors 2. findSynonyms to W2Vec scala and python API mengxr Author: MechCoder manojkumarsivaraj...@gmail.com Closes #7263 from MechCoder/missing_methods_w2vec and squashes the following commits: 149d5ca [MechCoder] minor doc 69d91b7 [MechCoder] [SPARK-8874] [ML] Add missing methods in Word2Vec (cherry picked from commit 13675c742a71cbdc8324701c3694775ce1dd5c62) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/acda9d95 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/acda9d95 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/acda9d95 Branch: refs/heads/branch-1.5 Commit: acda9d9546fa3f54676e48d76a2b66016d204074 Parents: 73fab88 Author: MechCoder manojkumarsivaraj...@gmail.com Authored: Mon Aug 3 16:44:25 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 16:46:00 2015 -0700 -- .../org/apache/spark/ml/feature/Word2Vec.scala | 38 +++- .../apache/spark/ml/feature/Word2VecSuite.scala | 62 2 files changed, 99 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/acda9d95/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala index 6ea6590..b4f46ce 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala @@ -18,15 +18,17 @@ package org.apache.spark.ml.feature import org.apache.spark.annotation.Experimental +import org.apache.spark.SparkContext import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.mllib.feature -import org.apache.spark.mllib.linalg.{VectorUDT, Vectors} +import org.apache.spark.mllib.linalg.{VectorUDT, Vector, Vectors} import org.apache.spark.mllib.linalg.BLAS._ import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ +import org.apache.spark.sql.SQLContext import org.apache.spark.sql.types._ /** @@ -146,6 +148,40 @@ class Word2VecModel private[ml] ( wordVectors: feature.Word2VecModel) extends Model[Word2VecModel] with Word2VecBase { + + /** + * Returns a dataframe with two fields, word and vector, with word being a String and + * and the vector the DenseVector that it is mapped to. + */ + val getVectors: DataFrame = { +val sc = SparkContext.getOrCreate() +val sqlContext = SQLContext.getOrCreate(sc) +import sqlContext.implicits._ +val wordVec = wordVectors.getVectors.mapValues(vec = Vectors.dense(vec.map(_.toDouble))) +sc.parallelize(wordVec.toSeq).toDF(word, vector) + } + + /** + * Find num number of words closest in similarity to the given word. + * Returns a dataframe with the words and the cosine similarities between the + * synonyms and the given word. + */ + def findSynonyms(word: String, num: Int): DataFrame = { +findSynonyms(wordVectors.transform(word), num) + } + + /** + * Find num number of words closest to similarity to the given vector representation + * of the word. Returns a dataframe with the words and the cosine similarities between the + * synonyms and the given word vector. + */ + def findSynonyms(word: Vector, num: Int): DataFrame = { +val sc = SparkContext.getOrCreate() +val sqlContext = SQLContext.getOrCreate(sc) +import sqlContext.implicits._ +sc.parallelize(wordVectors.findSynonyms(word, num)).toDF(word, similarity) + } + /** @group setParam */ def setInputCol(value: String): this.type = set(inputCol, value) http://git-wip-us.apache.org/repos/asf/spark/blob/acda9d95/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala index aa6ce53..adcda0e 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/Word2VecSuite.scala @@ -67,5 +67,67 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(vector1 ~== vector2 absTol 1E-5, Transformed vector is different with expected.) } } + + test(getVectors) { +
spark git commit: [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build
Repository: spark Updated Branches: refs/heads/branch-1.5 ebe42b98c - 1f7dbcd6f [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too Author: Sean Owen so...@cloudera.com Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits: 73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too (cherry picked from commit 0afa6fbf525723e97c6dacfdba3ad1762637ffa9) Signed-off-by: Sean Owen so...@cloudera.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1f7dbcd6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1f7dbcd6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1f7dbcd6 Branch: refs/heads/branch-1.5 Commit: 1f7dbcd6fdeee22c7b670ea98dcb4e794f84a8cd Parents: ebe42b9 Author: Sean Owen so...@cloudera.com Authored: Tue Aug 4 13:48:22 2015 +0900 Committer: Sean Owen so...@cloudera.com Committed: Tue Aug 4 06:56:06 2015 +0100 -- docs/building-spark.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1f7dbcd6/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index a5da3b3..f133eb9 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -7,7 +7,8 @@ redirect_from: building-with-maven.html * This will become a table of contents (this text will be scraped). {:toc} -Building Spark using Maven requires Maven 3.0.4 or newer and Java 7+. +Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. +The Spark build can supply a suitable Maven binary; see below. # Building with `build/mvn` - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build
Repository: spark Updated Branches: refs/heads/master 5eb89f67e - 0afa6fbf5 [SPARK-9521] [DOCS] Addendum. Require Maven 3.3.3+ in the build Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too Author: Sean Owen so...@cloudera.com Closes #7905 from srowen/SPARK-9521.2 and squashes the following commits: 73285df [Sean Owen] Follow on for #7852: Building Spark doc needs to refer to new Maven requirement too Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0afa6fbf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0afa6fbf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0afa6fbf Branch: refs/heads/master Commit: 0afa6fbf525723e97c6dacfdba3ad1762637ffa9 Parents: 5eb89f6 Author: Sean Owen so...@cloudera.com Authored: Tue Aug 4 13:48:22 2015 +0900 Committer: Kousuke Saruta saru...@oss.nttdata.co.jp Committed: Tue Aug 4 13:48:22 2015 +0900 -- docs/building-spark.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0afa6fbf/docs/building-spark.md -- diff --git a/docs/building-spark.md b/docs/building-spark.md index a5da3b3..f133eb9 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -7,7 +7,8 @@ redirect_from: building-with-maven.html * This will become a table of contents (this text will be scraped). {:toc} -Building Spark using Maven requires Maven 3.0.4 or newer and Java 7+. +Building Spark using Maven requires Maven 3.3.3 or newer and Java 7+. +The Spark build can supply a suitable Maven binary; see below. # Building with `build/mvn` - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[1/2] spark git commit: [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations
Repository: spark Updated Branches: refs/heads/branch-1.5 e7329ab31 - 29756ff11 http://git-wip-us.apache.org/repos/asf/spark/blob/29756ff1/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala index 9201d1e..450ab7b 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala @@ -57,8 +57,9 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark } val closureSerializer = SparkEnv.get.closureSerializer.newInstance() val func = (c: TaskContext, i: Iterator[String]) = i.next() -val task = new ResultTask[String, String](0, 0, - sc.broadcast(closureSerializer.serialize((rdd, func)).array), rdd.partitions(0), Seq(), 0) +val taskBinary = sc.broadcast(closureSerializer.serialize((rdd, func)).array) +val task = new ResultTask[String, String]( + 0, 0, taskBinary, rdd.partitions(0), Seq.empty, 0, Seq.empty) intercept[RuntimeException] { task.run(0, 0, null) } @@ -66,7 +67,7 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark } test(all TaskCompletionListeners should be called even if some fail) { -val context = new TaskContextImpl(0, 0, 0, 0, null, null) +val context = TaskContext.empty() val listener = mock(classOf[TaskCompletionListener]) context.addTaskCompletionListener(_ = throw new Exception(blah)) context.addTaskCompletionListener(listener) http://git-wip-us.apache.org/repos/asf/spark/blob/29756ff1/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index 3abb99c..f7cc4bb 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -136,7 +136,7 @@ class FakeTaskScheduler(sc: SparkContext, liveExecutors: (String, String)* /* ex /** * A Task implementation that results in a large serialized task. */ -class LargeTask(stageId: Int) extends Task[Array[Byte]](stageId, 0, 0) { +class LargeTask(stageId: Int) extends Task[Array[Byte]](stageId, 0, 0, Seq.empty) { val randomBuffer = new Array[Byte](TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024) val random = new Random(0) random.nextBytes(randomBuffer) http://git-wip-us.apache.org/repos/asf/spark/blob/29756ff1/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala b/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala index db718ec..05b3afe 100644 --- a/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala +++ b/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala @@ -138,7 +138,7 @@ class HashShuffleReaderSuite extends SparkFunSuite with LocalSparkContext { shuffleHandle, reduceId, reduceId + 1, - new TaskContextImpl(0, 0, 0, 0, null, null), + TaskContext.empty(), blockManager, mapOutputTracker) http://git-wip-us.apache.org/repos/asf/spark/blob/29756ff1/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala index cf8bd8a..828153b 100644 --- a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala +++ b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala @@ -29,7 +29,7 @@ import org.mockito.invocation.InvocationOnMock import org.mockito.stubbing.Answer import org.scalatest.PrivateMethodTester -import org.apache.spark.{SparkFunSuite, TaskContextImpl} +import org.apache.spark.{SparkFunSuite, TaskContext} import org.apache.spark.network._ import org.apache.spark.network.buffer.ManagedBuffer import org.apache.spark.network.shuffle.BlockFetchingListener @@ -95,7 +95,7 @@ class ShuffleBlockFetcherIteratorSuite extends SparkFunSuite with PrivateMethodT ) val iterator = new ShuffleBlockFetcherIterator( - new TaskContextImpl(0, 0, 0, 0, null, null), + TaskContext.empty(), transfer, blockManager, blocksByAddress, @@
[1/2] spark git commit: [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations
Repository: spark Updated Branches: refs/heads/master e4765a468 - 702aa9d7f http://git-wip-us.apache.org/repos/asf/spark/blob/702aa9d7/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala index 9201d1e..450ab7b 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskContextSuite.scala @@ -57,8 +57,9 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark } val closureSerializer = SparkEnv.get.closureSerializer.newInstance() val func = (c: TaskContext, i: Iterator[String]) = i.next() -val task = new ResultTask[String, String](0, 0, - sc.broadcast(closureSerializer.serialize((rdd, func)).array), rdd.partitions(0), Seq(), 0) +val taskBinary = sc.broadcast(closureSerializer.serialize((rdd, func)).array) +val task = new ResultTask[String, String]( + 0, 0, taskBinary, rdd.partitions(0), Seq.empty, 0, Seq.empty) intercept[RuntimeException] { task.run(0, 0, null) } @@ -66,7 +67,7 @@ class TaskContextSuite extends SparkFunSuite with BeforeAndAfter with LocalSpark } test(all TaskCompletionListeners should be called even if some fail) { -val context = new TaskContextImpl(0, 0, 0, 0, null, null) +val context = TaskContext.empty() val listener = mock(classOf[TaskCompletionListener]) context.addTaskCompletionListener(_ = throw new Exception(blah)) context.addTaskCompletionListener(listener) http://git-wip-us.apache.org/repos/asf/spark/blob/702aa9d7/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala index 3abb99c..f7cc4bb 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala @@ -136,7 +136,7 @@ class FakeTaskScheduler(sc: SparkContext, liveExecutors: (String, String)* /* ex /** * A Task implementation that results in a large serialized task. */ -class LargeTask(stageId: Int) extends Task[Array[Byte]](stageId, 0, 0) { +class LargeTask(stageId: Int) extends Task[Array[Byte]](stageId, 0, 0, Seq.empty) { val randomBuffer = new Array[Byte](TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024) val random = new Random(0) random.nextBytes(randomBuffer) http://git-wip-us.apache.org/repos/asf/spark/blob/702aa9d7/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala b/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala index db718ec..05b3afe 100644 --- a/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala +++ b/core/src/test/scala/org/apache/spark/shuffle/hash/HashShuffleReaderSuite.scala @@ -138,7 +138,7 @@ class HashShuffleReaderSuite extends SparkFunSuite with LocalSparkContext { shuffleHandle, reduceId, reduceId + 1, - new TaskContextImpl(0, 0, 0, 0, null, null), + TaskContext.empty(), blockManager, mapOutputTracker) http://git-wip-us.apache.org/repos/asf/spark/blob/702aa9d7/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala index cf8bd8a..828153b 100644 --- a/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala +++ b/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala @@ -29,7 +29,7 @@ import org.mockito.invocation.InvocationOnMock import org.mockito.stubbing.Answer import org.scalatest.PrivateMethodTester -import org.apache.spark.{SparkFunSuite, TaskContextImpl} +import org.apache.spark.{SparkFunSuite, TaskContext} import org.apache.spark.network._ import org.apache.spark.network.buffer.ManagedBuffer import org.apache.spark.network.shuffle.BlockFetchingListener @@ -95,7 +95,7 @@ class ShuffleBlockFetcherIteratorSuite extends SparkFunSuite with PrivateMethodT ) val iterator = new ShuffleBlockFetcherIterator( - new TaskContextImpl(0, 0, 0, 0, null, null), + TaskContext.empty(), transfer, blockManager, blocksByAddress, @@ -165,7
[2/2] spark git commit: [SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations
[SPARK-8735] [SQL] Expose memory usage for shuffles, joins and aggregations This patch exposes the memory used by internal data structures on the SparkUI. This tracks memory used by all spilling operations and SQL operators backed by Tungsten, e.g. `BroadcastHashJoin`, `ExternalSort`, `GeneratedAggregate` etc. The metric exposed is peak execution memory, which broadly refers to the peak in-memory sizes of each of these data structure. A separate patch will extend this by linking the new information to the SQL operators themselves. img width=950 alt=screen shot 2015-07-29 at 7 43 17 pm src=https://cloud.githubusercontent.com/assets/2133137/8974776/b90fc980-362a-11e5-9e2b-842da75b1641.png; img width=802 alt=screen shot 2015-07-29 at 7 43 05 pm src=https://cloud.githubusercontent.com/assets/2133137/8974777/baa76492-362a-11e5-9b77-e364a6a6b64e.png; !-- Reviewable:start -- [img src=https://reviewable.io/review_button.png; height=40 alt=Review on Reviewable/](https://reviewable.io/reviews/apache/spark/7770) !-- Reviewable:end -- Author: Andrew Or and...@databricks.com Closes #7770 from andrewor14/expose-memory-metrics and squashes the following commits: 9abecb9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics f5b0d68 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d7df332 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 8eefbc5 [Andrew Or] Fix non-failing tests 9de2a12 [Andrew Or] Fix tests due to another logical merge conflict 876bfa4 [Andrew Or] Fix failing test after logical merge conflict 361a359 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 40b4802 [Andrew Or] Fix style? d0fef87 [Andrew Or] Fix tests? b3b92f6 [Andrew Or] Address comments 0625d73 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics c00a197 [Andrew Or] Fix potential NPEs 10da1cd [Andrew Or] Fix compile 17f4c2d [Andrew Or] Fix compile? a87b4d0 [Andrew Or] Fix compile? d70874d [Andrew Or] Fix test compile + address comments 2840b7d [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 6aa2f7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics b889a68 [Andrew Or] Minor changes: comments, spacing, style 663a303 [Andrew Or] UnsafeShuffleWriter: update peak memory before close d090a94 [Andrew Or] Fix style 2480d84 [Andrew Or] Expand test coverage 5f1235b [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 1ecf678 [Andrew Or] Minor changes: comments, style, unused imports 0b6926c [Andrew Or] Oops 111a05e [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics a7a39a5 [Andrew Or] Strengthen presence check for accumulator a919eb7 [Andrew Or] Add tests for unsafe shuffle writer 23c845d [Andrew Or] Add tests for SQL operators a757550 [Andrew Or] Address comments b5c51c1 [Andrew Or] Re-enable test in JavaAPISuite 5107691 [Andrew Or] Add tests for internal accumulators 59231e4 [Andrew Or] Fix tests 9528d09 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics 5b5e6f3 [Andrew Or] Add peak execution memory to summary table + tooltip 92b4b6b [Andrew Or] Display peak execution memory on the UI eee5437 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics d9b9015 [Andrew Or] Track execution memory in unsafe shuffles 770ee54 [Andrew Or] Track execution memory in broadcast joins 9c605a4 [Andrew Or] Track execution memory in GeneratedAggregate 9e824f2 [Andrew Or] Add back execution memory tracking for *ExternalSort 4ef4cb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into expose-memory-metrics e6c3e2f [Andrew Or] Move internal accumulators creation to Stage a417592 [Andrew Or] Expose memory metrics in UnsafeExternalSorter 3c4f042 [Andrew Or] Track memory usage in ExternalAppendOnlyMap / ExternalSorter bd7ab3f [Andrew Or] Add internal accumulators to TaskContext (cherry picked from commit 702aa9d7fb16c98a50e046edfd76b8a7861d0391) Signed-off-by: Josh Rosen joshro...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/29756ff1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/29756ff1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/29756ff1 Branch: refs/heads/branch-1.5 Commit: 29756ff11c7bea73436153f37af631cbe5e58250 Parents: e7329ab Author: Andrew Or and...@databricks.com Authored: Mon Aug 3 14:22:07 2015 -0700 Committer: Josh Rosen joshro...@databricks.com Committed: Mon Aug 3 14:22:25 2015 -0700 -- .../unsafe/UnsafeShuffleExternalSorter.java | 27 ++- .../shuffle/unsafe/UnsafeShuffleWriter.java | 38 +++- .../spark/unsafe/map/BytesToBytesMap.java
spark git commit: [SPARK-8873] [MESOS] Clean up shuffle files if external shuffle service is used
Repository: spark Updated Branches: refs/heads/master 1ebd41b14 - 95dccc633 [SPARK-8873] [MESOS] Clean up shuffle files if external shuffle service is used This patch builds directly on #7820, which is largely written by tnachen. The only addition is one commit for cleaning up the code. There should be no functional differences between this and #7820. Author: Timothy Chen tnac...@gmail.com Author: Andrew Or and...@databricks.com Closes #7881 from andrewor14/tim-cleanup-mesos-shuffle and squashes the following commits: 8894f7d [Andrew Or] Clean up code 2a5fa10 [Andrew Or] Merge branch 'mesos_shuffle_clean' of github.com:tnachen/spark into tim-cleanup-mesos-shuffle fadff89 [Timothy Chen] Address comments. e4d0f1d [Timothy Chen] Clean up external shuffle data on driver exit with Mesos. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/95dccc63 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/95dccc63 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/95dccc63 Branch: refs/heads/master Commit: 95dccc63350c45045f038bab9f8a5080b4e1f8cc Parents: 1ebd41b Author: Timothy Chen tnac...@gmail.com Authored: Mon Aug 3 01:55:58 2015 -0700 Committer: Andrew Or and...@databricks.com Committed: Mon Aug 3 01:55:58 2015 -0700 -- .../scala/org/apache/spark/SparkContext.scala | 2 +- .../spark/deploy/ExternalShuffleService.scala | 17 ++- .../mesos/MesosExternalShuffleService.scala | 107 +++ .../org/apache/spark/rpc/RpcEndpoint.scala | 6 +- .../mesos/CoarseMesosSchedulerBackend.scala | 52 - .../CoarseMesosSchedulerBackendSuite.scala | 5 +- .../launcher/SparkClassCommandBuilder.java | 3 +- .../spark/network/client/TransportClient.java | 5 + .../shuffle/ExternalShuffleBlockHandler.java| 6 ++ .../network/shuffle/ExternalShuffleClient.java | 12 ++- .../mesos/MesosExternalShuffleClient.java | 72 + .../shuffle/protocol/BlockTransferMessage.java | 4 +- .../shuffle/protocol/mesos/RegisterDriver.java | 60 +++ sbin/start-mesos-shuffle-service.sh | 35 ++ sbin/stop-mesos-shuffle-service.sh | 25 + 15 files changed, 394 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/95dccc63/core/src/main/scala/org/apache/spark/SparkContext.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkContext.scala b/core/src/main/scala/org/apache/spark/SparkContext.scala index a1c66ef..6f336a7 100644 --- a/core/src/main/scala/org/apache/spark/SparkContext.scala +++ b/core/src/main/scala/org/apache/spark/SparkContext.scala @@ -2658,7 +2658,7 @@ object SparkContext extends Logging { val coarseGrained = sc.conf.getBoolean(spark.mesos.coarse, false) val url = mesosUrl.stripPrefix(mesos://) // strip scheme from raw Mesos URLs val backend = if (coarseGrained) { - new CoarseMesosSchedulerBackend(scheduler, sc, url) + new CoarseMesosSchedulerBackend(scheduler, sc, url, sc.env.securityManager) } else { new MesosSchedulerBackend(scheduler, sc, url) } http://git-wip-us.apache.org/repos/asf/spark/blob/95dccc63/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala -- diff --git a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala index 4089c3e..20a9faa 100644 --- a/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala +++ b/core/src/main/scala/org/apache/spark/deploy/ExternalShuffleService.scala @@ -27,6 +27,7 @@ import org.apache.spark.network.netty.SparkTransportConf import org.apache.spark.network.sasl.SaslServerBootstrap import org.apache.spark.network.server.TransportServer import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler +import org.apache.spark.network.util.TransportConf import org.apache.spark.util.Utils /** @@ -45,11 +46,16 @@ class ExternalShuffleService(sparkConf: SparkConf, securityManager: SecurityMana private val useSasl: Boolean = securityManager.isAuthenticationEnabled() private val transportConf = SparkTransportConf.fromSparkConf(sparkConf, numUsableCores = 0) - private val blockHandler = new ExternalShuffleBlockHandler(transportConf) + private val blockHandler = newShuffleBlockHandler(transportConf) private val transportContext: TransportContext = new TransportContext(transportConf, blockHandler) private var server: TransportServer = _ + /** Create a new shuffle block handler. Factored out for subclasses to override. */ + protected def
spark git commit: [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data
Repository: spark Updated Branches: refs/heads/master 687c8c371 - 608353c8e [SPARK-9404][SPARK-9542][SQL] unsafe array data and map data This PR adds a UnsafeArrayData, current we encode it in this way: first 4 bytes is the # elements then each 4 byte is the start offset of the element, unless it is negative, in which case the element is null. followed by the elements themselves an example: [10, 11, 12, 13, null, 14] will be encoded as: 5, 28, 32, 36, 40, -44, 44, 10, 11, 12, 13, 14 Note that, when we read a UnsafeArrayData from bytes, we can read the first 4 bytes as numElements and take the rest(first 4 bytes skipped) as value region. unsafe map data just use 2 unsafe array data, first 4 bytes is # of elements, second 4 bytes is numBytes of key array, the follows key array data and value array data. Author: Wenchen Fan cloud0...@outlook.com Closes #7752 from cloud-fan/unsafe-array and squashes the following commits: 3269bd7 [Wenchen Fan] fix a bug 6445289 [Wenchen Fan] add unit tests 49adf26 [Wenchen Fan] add unsafe map 20d1039 [Wenchen Fan] add comments and unsafe converter 821b8db [Wenchen Fan] add unsafe array Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/608353c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/608353c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/608353c8 Branch: refs/heads/master Commit: 608353c8e8e50461fafff91a2c885dca8af3aaa8 Parents: 687c8c3 Author: Wenchen Fan cloud0...@outlook.com Authored: Sun Aug 2 23:41:16 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Sun Aug 2 23:41:16 2015 -0700 -- .../catalyst/expressions/UnsafeArrayData.java | 333 +++ .../sql/catalyst/expressions/UnsafeMapData.java | 66 .../sql/catalyst/expressions/UnsafeReaders.java | 48 +++ .../sql/catalyst/expressions/UnsafeRow.java | 34 +- .../catalyst/expressions/UnsafeRowWriters.java | 71 .../sql/catalyst/expressions/UnsafeWriters.java | 208 .../sql/catalyst/expressions/FromUnsafe.scala | 67 .../sql/catalyst/expressions/Projection.scala | 10 +- .../expressions/codegen/CodeGenerator.scala | 4 +- .../codegen/GenerateUnsafeProjection.scala | 327 +- .../spark/sql/types/ArrayBasedMapData.scala | 15 +- .../org/apache/spark/sql/types/ArrayData.scala | 14 +- .../spark/sql/types/GenericArrayData.scala | 10 +- .../org/apache/spark/sql/types/MapData.scala| 2 + .../expressions/UnsafeRowConverterSuite.scala | 114 ++- .../apache/spark/unsafe/types/UTF8String.java | 3 + 16 files changed, 1295 insertions(+), 31 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/608353c8/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java new file mode 100644 index 000..0374846 --- /dev/null +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java @@ -0,0 +1,333 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions; + +import java.math.BigDecimal; +import java.math.BigInteger; + +import org.apache.spark.sql.catalyst.InternalRow; +import org.apache.spark.sql.types.*; +import org.apache.spark.unsafe.PlatformDependent; +import org.apache.spark.unsafe.array.ByteArrayMethods; +import org.apache.spark.unsafe.hash.Murmur3_x86_32; +import org.apache.spark.unsafe.types.CalendarInterval; +import org.apache.spark.unsafe.types.UTF8String; + +/** + * An Unsafe implementation of Array which is backed by raw memory instead of Java objects. + * + * Each tuple has two parts: [offsets] [values] + * + * In the `offsets` region, we store 4 bytes per element, represents the start address of this + * element in
spark git commit: [SPARK-9549][SQL] fix bugs in expressions
Repository: spark Updated Branches: refs/heads/master 608353c8e - 98d6d9c7a [SPARK-9549][SQL] fix bugs in expressions JIRA: https://issues.apache.org/jira/browse/SPARK-9549 This PR fix the following bugs: 1. `UnaryMinus`'s codegen version would fail to compile when the input is `Long.MinValue` 2. `BinaryComparison` would fail to compile in codegen mode when comparing Boolean types. 3. `AddMonth` would fail if passed a huge negative month, which would lead accessing negative index of `monthDays` array. 4. `Nanvl` with different type operands. Author: Yijie Shen henry.yijies...@gmail.com Closes #7882 from yjshen/minor_bug_fix and squashes the following commits: 41bbd2c [Yijie Shen] fix bug in Nanvl type coercion 3dee204 [Yijie Shen] address comments 4fa5de0 [Yijie Shen] fix bugs in expressions Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/98d6d9c7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/98d6d9c7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/98d6d9c7 Branch: refs/heads/master Commit: 98d6d9c7a996f5456eb2653bb96985a1a05f4ce1 Parents: 608353c Author: Yijie Shen henry.yijies...@gmail.com Authored: Mon Aug 3 00:15:24 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 00:15:24 2015 -0700 -- .../catalyst/analysis/HiveTypeCoercion.scala| 5 ++ .../sql/catalyst/expressions/arithmetic.scala | 9 ++- .../sql/catalyst/expressions/predicates.scala | 1 + .../spark/sql/catalyst/util/DateTimeUtils.scala | 7 ++- .../analysis/HiveTypeCoercionSuite.scala| 12 .../expressions/ArithmeticExpressionSuite.scala | 6 +- .../expressions/DateExpressionsSuite.scala | 2 + .../catalyst/expressions/PredicateSuite.scala | 62 ++-- .../spark/sql/ColumnExpressionSuite.scala | 18 +++--- 9 files changed, 79 insertions(+), 43 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/98d6d9c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala index 603afc4..422d423 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala @@ -562,6 +562,11 @@ object HiveTypeCoercion { case Some(finalDataType) = Coalesce(es.map(Cast(_, finalDataType))) case None = c } + + case NaNvl(l, r) if l.dataType == DoubleType r.dataType == FloatType = +NaNvl(l, Cast(r, DoubleType)) + case NaNvl(l, r) if l.dataType == FloatType r.dataType == DoubleType = +NaNvl(Cast(l, DoubleType), r) } } http://git-wip-us.apache.org/repos/asf/spark/blob/98d6d9c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala index 6f8f4dd..0891b55 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala @@ -36,7 +36,14 @@ case class UnaryMinus(child: Expression) extends UnaryExpression with ExpectsInp override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = dataType match { case dt: DecimalType = defineCodeGen(ctx, ev, c = s$c.unary_$$minus()) -case dt: NumericType = defineCodeGen(ctx, ev, c = s(${ctx.javaType(dt)})(-($c))) +case dt: NumericType = nullSafeCodeGen(ctx, ev, eval = { + val originValue = ctx.freshName(origin) + // codegen would fail to compile if we just write (-($c)) + // for example, we could not write --9223372036854775808L in code + s +${ctx.javaType(dt)} $originValue = (${ctx.javaType(dt)})($eval); +${ev.primitive} = (${ctx.javaType(dt)})(-($originValue)); + }) case dt: CalendarIntervalType = defineCodeGen(ctx, ev, c = s$c.negate()) } http://git-wip-us.apache.org/repos/asf/spark/blob/98d6d9c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala
spark git commit: [SPARK-9372] [SQL] Filter nulls in join keys
Repository: spark Updated Branches: refs/heads/master 4cdd8ecd6 - 687c8c371 [SPARK-9372] [SQL] Filter nulls in join keys This PR adds an optimization rule, `FilterNullsInJoinKey`, to add `Filter` before join operators to filter out rows having null values for join keys. This optimization is guarded by a new SQL conf, `spark.sql.advancedOptimization`. The code in this PR was authored by yhuai; I'm opening this PR to factor out this change from #7685, a larger pull request which contains two other optimizations. Author: Yin Huai yh...@databricks.com Author: Josh Rosen joshro...@databricks.com Closes #7768 from JoshRosen/filter-nulls-in-join-key and squashes the following commits: c02fc3f [Yin Huai] Address Josh's comments. 0a8e096 [Yin Huai] Update comments. ea7d5a6 [Yin Huai] Make sure we do not keep adding filters. be88760 [Yin Huai] Make it clear that FilterNullsInJoinKeySuite.scala is used to test FilterNullsInJoinKey. 8bb39ad [Yin Huai] Fix non-deterministic tests. 303236b [Josh Rosen] Revert changes that are unrelated to null join key filtering 40eeece [Josh Rosen] Merge remote-tracking branch 'origin/master' into filter-nulls-in-join-key c57a954 [Yin Huai] Bug fix. d3d2e64 [Yin Huai] First round of cleanup. f9516b0 [Yin Huai] Style c6667e7 [Yin Huai] Add PartitioningCollection. e616d3b [Yin Huai] wip 7c2d2d8 [Yin Huai] Bug fix and refactoring. 69bb072 [Yin Huai] Introduce NullSafeHashPartitioning and NullUnsafePartitioning. d5b84c3 [Yin Huai] Do not add unnessary filters. 2201129 [Yin Huai] Filter out rows that will not be joined in equal joins early. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/687c8c37 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/687c8c37 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/687c8c37 Branch: refs/heads/master Commit: 687c8c37150f4c93f8e57d86bb56321a4891286b Parents: 4cdd8ec Author: Yin Huai yh...@databricks.com Authored: Sun Aug 2 23:32:09 2015 -0700 Committer: Josh Rosen joshro...@databricks.com Committed: Sun Aug 2 23:32:09 2015 -0700 -- .../catalyst/expressions/nullFunctions.scala| 48 +++- .../sql/catalyst/optimizer/Optimizer.scala | 64 +++-- .../catalyst/plans/logical/basicOperators.scala | 32 ++- .../expressions/ExpressionEvalHelper.scala | 4 +- .../expressions/MathFunctionsSuite.scala| 3 +- .../expressions/NullFunctionsSuite.scala| 49 +++- .../apache/spark/sql/DataFrameNaFunctions.scala | 2 +- .../scala/org/apache/spark/sql/SQLConf.scala| 6 + .../scala/org/apache/spark/sql/SQLContext.scala | 5 +- .../extendedOperatorOptimizations.scala | 160 + .../optimizer/FilterNullsInJoinKeySuite.scala | 236 +++ 11 files changed, 572 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/687c8c37/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala index 287718f..d58c475 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/nullFunctions.scala @@ -210,14 +210,58 @@ case class IsNotNull(child: Expression) extends UnaryExpression with Predicate { } } +/** + * A predicate that is evaluated to be true if there are at least `n` null values. + */ +case class AtLeastNNulls(n: Int, children: Seq[Expression]) extends Predicate { + override def nullable: Boolean = false + override def foldable: Boolean = children.forall(_.foldable) + override def toString: String = sAtLeastNNulls($n, ${children.mkString(,)}) + + private[this] val childrenArray = children.toArray + + override def eval(input: InternalRow): Boolean = { +var numNulls = 0 +var i = 0 +while (i childrenArray.length numNulls n) { + val evalC = childrenArray(i).eval(input) + if (evalC == null) { +numNulls += 1 + } + i += 1 +} +numNulls = n + } + + override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = { +val numNulls = ctx.freshName(numNulls) +val code = children.map { e = + val eval = e.gen(ctx) + s +if ($numNulls $n) { + ${eval.code} + if (${eval.isNull}) { +$numNulls += 1; + } +} + +}.mkString(\n) +s + int $numNulls = 0; + $code + boolean ${ev.isNull} = false; + boolean ${ev.primitive} = $numNulls = $n; + + } +} /** * A
[2/2] spark git commit: [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row
[SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row This PR adds a base aggregation iterator `AggregationIterator`, which is used to create `SortBasedAggregationIterator` (for sort-based aggregation) and `UnsafeHybridAggregationIterator` (first it tries hash-based aggregation and falls back to the sort-based aggregation (using external sorter) if we cannot allocate memory for the map). With these two iterators, we will not need existing iterators and I am removing those. Also, we can use a single physical `Aggregate` operator and it internally determines what iterators to used. https://issues.apache.org/jira/browse/SPARK-9240 Author: Yin Huai yh...@databricks.com Closes #7813 from yhuai/AggregateOperator and squashes the following commits: e317e2b [Yin Huai] Remove unnecessary change. 74d93c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into AggregateOperator ba6afbc [Yin Huai] Add a little bit more comments. c9cf3b6 [Yin Huai] update 0f1b06f [Yin Huai] Remove unnecessary code. 21fd15f [Yin Huai] Remove unnecessary change. 964f88b [Yin Huai] Implement fallback strategy. b1ea5cf [Yin Huai] wip 7fcbd87 [Yin Huai] Add a flag to control what iterator to use. 533d5b2 [Yin Huai] Prepare for fallback! 33b7022 [Yin Huai] wip bd9282b [Yin Huai] UDAFs now supports UnsafeRow. f52ee53 [Yin Huai] wip 3171f44 [Yin Huai] wip d2c45a0 [Yin Huai] wip f60cc83 [Yin Huai] Also check input schema. af32210 [Yin Huai] Check iter.hasNext before we create an iterator because the constructor of the iterato will read at least one row from a non-empty input iter. 299008c [Yin Huai] First round cleanup. 3915bac [Yin Huai] Create a base iterator class for aggregation iterators and add the initial version of the hybrid iterator. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1ebd41b1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1ebd41b1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1ebd41b1 Branch: refs/heads/master Commit: 1ebd41b141a95ec264bd2dd50f0fe24cd459035d Parents: 98d6d9c Author: Yin Huai yh...@databricks.com Authored: Mon Aug 3 00:23:08 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 00:23:08 2015 -0700 -- .../expressions/aggregate/interfaces.scala | 19 +- .../sql/execution/aggregate/Aggregate.scala | 182 + .../aggregate/AggregationIterator.scala | 490 ++ .../SortBasedAggregationIterator.scala | 236 +++ .../UnsafeHybridAggregationIterator.scala | 398 +++ .../aggregate/aggregateOperators.scala | 175 - .../aggregate/sortBasedIterators.scala | 664 --- .../spark/sql/execution/aggregate/udaf.scala| 269 +++- .../spark/sql/execution/aggregate/utils.scala | 99 +-- .../spark/sql/execution/basicOperators.scala| 1 - .../org/apache/spark/sql/SQLQuerySuite.scala| 10 +- .../execution/SparkSqlSerializer2Suite.scala| 9 +- .../hive/execution/AggregationQuerySuite.scala | 118 ++-- 13 files changed, 1697 insertions(+), 973 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1ebd41b1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala index d08f553..4abfdfe 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala @@ -110,7 +110,11 @@ abstract class AggregateFunction2 * buffer value of `avg(x)` will be 0 and the position of the first buffer value of `avg(y)` * will be 2. */ - var mutableBufferOffset: Int = 0 + protected var mutableBufferOffset: Int = 0 + + def withNewMutableBufferOffset(newMutableBufferOffset: Int): Unit = { +mutableBufferOffset = newMutableBufferOffset + } /** * The offset of this function's start buffer value in the @@ -126,7 +130,11 @@ abstract class AggregateFunction2 * buffer value of `avg(x)` will be 1 and the position of the first buffer value of `avg(y)` * will be 3 (position 0 is used for the value of key`). */ - var inputBufferOffset: Int = 0 + protected var inputBufferOffset: Int = 0 + + def withNewInputBufferOffset(newInputBufferOffset: Int): Unit = { +inputBufferOffset = newInputBufferOffset + } /** The schema of the aggregation buffer. */ def bufferSchema: StructType @@ -195,11 +203,8 @@ abstract class AlgebraicAggregate extends
[1/2] spark git commit: [SPARK-9240] [SQL] Hybrid aggregate operator using unsafe row
Repository: spark Updated Branches: refs/heads/master 98d6d9c7a - 1ebd41b14 http://git-wip-us.apache.org/repos/asf/spark/blob/1ebd41b1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/sortBasedIterators.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/sortBasedIterators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/sortBasedIterators.scala deleted file mode 100644 index 2ca0cb8..000 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/sortBasedIterators.scala +++ /dev/null @@ -1,664 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the License); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - *http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an AS IS BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.spark.sql.execution.aggregate - -import org.apache.spark.sql.catalyst.InternalRow -import org.apache.spark.sql.catalyst.expressions._ -import org.apache.spark.sql.catalyst.expressions.aggregate._ -import org.apache.spark.sql.types.NullType - -import scala.collection.mutable.ArrayBuffer - -/** - * An iterator used to evaluate aggregate functions. It assumes that input rows - * are already grouped by values of `groupingExpressions`. - */ -private[sql] abstract class SortAggregationIterator( -groupingExpressions: Seq[NamedExpression], -aggregateExpressions: Seq[AggregateExpression2], -newMutableProjection: (Seq[Expression], Seq[Attribute]) = (() = MutableProjection), -inputAttributes: Seq[Attribute], -inputIter: Iterator[InternalRow]) - extends Iterator[InternalRow] { - - /// - // Static fields for this iterator - /// - - protected val aggregateFunctions: Array[AggregateFunction2] = { -var mutableBufferOffset = 0 -var inputBufferOffset: Int = initialInputBufferOffset -val functions = new Array[AggregateFunction2](aggregateExpressions.length) -var i = 0 -while (i aggregateExpressions.length) { - val func = aggregateExpressions(i).aggregateFunction - val funcWithBoundReferences = aggregateExpressions(i).mode match { -case Partial | Complete if !func.isInstanceOf[AlgebraicAggregate] = - // We need to create BoundReferences if the function is not an - // AlgebraicAggregate (it does not support code-gen) and the mode of - // this function is Partial or Complete because we will call eval of this - // function's children in the update method of this aggregate function. - // Those eval calls require BoundReferences to work. - BindReferences.bindReference(func, inputAttributes) -case _ = - // We only need to set inputBufferOffset for aggregate functions with mode - // PartialMerge and Final. - func.inputBufferOffset = inputBufferOffset - inputBufferOffset += func.bufferSchema.length - func - } - // Set mutableBufferOffset for this function. It is important that setting - // mutableBufferOffset happens after all potential bindReference operations - // because bindReference will create a new instance of the function. - funcWithBoundReferences.mutableBufferOffset = mutableBufferOffset - mutableBufferOffset += funcWithBoundReferences.bufferSchema.length - functions(i) = funcWithBoundReferences - i += 1 -} -functions - } - - // Positions of those non-algebraic aggregate functions in aggregateFunctions. - // For example, we have func1, func2, func3, func4 in aggregateFunctions, and - // func2 and func3 are non-algebraic aggregate functions. - // nonAlgebraicAggregateFunctionPositions will be [1, 2]. - protected val nonAlgebraicAggregateFunctionPositions: Array[Int] = { -val positions = new ArrayBuffer[Int]() -var i = 0 -while (i aggregateFunctions.length) { - aggregateFunctions(i) match { -case agg: AlgebraicAggregate = -case _ = positions += i - } - i += 1 -} -positions.toArray - } - - // All non-algebraic aggregate functions. - protected val nonAlgebraicAggregateFunctions: Array[AggregateFunction2] = -
spark git commit: [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor
Repository: spark Updated Branches: refs/heads/master 703e44bff - ff9169a00 [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor Added featureImportance to RandomForestClassifier and Regressor. This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341] CC: yanboliang Would you mind taking a look? Thanks! Author: Joseph K. Bradley jos...@databricks.com Author: Feynman Liang fli...@databricks.com Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits: 72a167a [Joseph K. Bradley] fixed unit test 86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map 5aa74f0 [Joseph K. Bradley] finally fixed unit test for real 33df5db [Joseph K. Bradley] fix unit test 42a2d3b [Joseph K. Bradley] fix unit test fe94e72 [Joseph K. Bradley] modified feature importance unit tests cc693ee [Feynman Liang] Add classifier tests 79a6f87 [Feynman Liang] Compare dense vectors in test 21d01fc [Feynman Liang] Added failing SKLearn test ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ff9169a0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ff9169a0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ff9169a0 Branch: refs/heads/master Commit: ff9169a002f1b75231fd25b7d04157a912503038 Parents: 703e44b Author: Joseph K. Bradley jos...@databricks.com Authored: Mon Aug 3 12:17:46 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 12:17:46 2015 -0700 -- .../classification/RandomForestClassifier.scala | 30 +- .../ml/regression/RandomForestRegressor.scala | 33 -- .../scala/org/apache/spark/ml/tree/Node.scala | 19 +++- .../spark/ml/tree/impl/RandomForest.scala | 92 .../org/apache/spark/ml/tree/treeModels.scala | 6 ++ .../JavaRandomForestClassifierSuite.java| 2 + .../JavaRandomForestRegressorSuite.java | 2 + .../RandomForestClassifierSuite.scala | 31 +- .../org/apache/spark/ml/impl/TreeTests.scala| 18 .../regression/RandomForestRegressorSuite.scala | 27 - .../spark/ml/tree/impl/RandomForestSuite.scala | 107 +++ 11 files changed, 351 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ff9169a0/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala index 56e80cc..b59826a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala @@ -95,7 +95,8 @@ final class RandomForestClassifier(override val uid: String) val trees = RandomForest.run(oldDataset, strategy, getNumTrees, getFeatureSubsetStrategy, getSeed) .map(_.asInstanceOf[DecisionTreeClassificationModel]) -new RandomForestClassificationModel(trees, numClasses) +val numFeatures = oldDataset.first().features.size +new RandomForestClassificationModel(trees, numFeatures, numClasses) } override def copy(extra: ParamMap): RandomForestClassifier = defaultCopy(extra) @@ -118,11 +119,13 @@ object RandomForestClassifier { * features. * @param _trees Decision trees in the ensemble. * Warning: These have null parents. + * @param numFeatures Number of features used by this model */ @Experimental final class RandomForestClassificationModel private[ml] ( override val uid: String, private val _trees: Array[DecisionTreeClassificationModel], +val numFeatures: Int, override val numClasses: Int) extends ProbabilisticClassificationModel[Vector, RandomForestClassificationModel] with TreeEnsembleModel with Serializable { @@ -133,8 +136,8 @@ final class RandomForestClassificationModel private[ml] ( * Construct a random forest classification model, with all trees weighted equally. * @param trees Component trees */ - def this(trees: Array[DecisionTreeClassificationModel], numClasses: Int) = -this(Identifiable.randomUID(rfc), trees, numClasses) + def this(trees: Array[DecisionTreeClassificationModel], numFeatures: Int, numClasses: Int) = +this(Identifiable.randomUID(rfc), trees, numFeatures, numClasses) override def trees: Array[DecisionTreeModel] =
spark git commit: [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes.
Repository: spark Updated Branches: refs/heads/branch-1.5 4de833e9e - 5452e93f0 [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes. Author: Reynold Xin r...@databricks.com Closes #7897 from rxin/calculateBitSetWidthInBytes and squashes the following commits: 2e73b3a [Reynold Xin] [SQL][minor] Simplify UnsafeRow.calculateBitSetWidthInBytes. (cherry picked from commit 7a9d09f0bb472a1671d3457e1f7108f4c2eb4121) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5452e93f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5452e93f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5452e93f Branch: refs/heads/branch-1.5 Commit: 5452e93f03bc308282cb8f189f65bb1b258d8813 Parents: 4de833e Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 11:22:02 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 11:22:10 2015 -0700 -- .../apache/spark/sql/catalyst/expressions/UnsafeRow.java | 2 +- .../test/scala/org/apache/spark/sql/UnsafeRowSuite.scala | 10 ++ 2 files changed, 11 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5452e93f/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index f4230cf..e6750fc 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -59,7 +59,7 @@ public final class UnsafeRow extends MutableRow { // public static int calculateBitSetWidthInBytes(int numFields) { -return ((numFields / 64) + (numFields % 64 == 0 ? 0 : 1)) * 8; +return ((numFields + 63)/ 64) * 8; } /** http://git-wip-us.apache.org/repos/asf/spark/blob/5452e93f/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala index c5faaa6..89bad1b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala @@ -28,6 +28,16 @@ import org.apache.spark.unsafe.memory.MemoryAllocator import org.apache.spark.unsafe.types.UTF8String class UnsafeRowSuite extends SparkFunSuite { + + test(bitset width calculation) { +assert(UnsafeRow.calculateBitSetWidthInBytes(0) === 0) +assert(UnsafeRow.calculateBitSetWidthInBytes(1) === 8) +assert(UnsafeRow.calculateBitSetWidthInBytes(32) === 8) +assert(UnsafeRow.calculateBitSetWidthInBytes(64) === 8) +assert(UnsafeRow.calculateBitSetWidthInBytes(65) === 16) +assert(UnsafeRow.calculateBitSetWidthInBytes(128) === 16) + } + test(writeToStream) { val row = InternalRow.apply(UTF8String.fromString(hello), UTF8String.fromString(world), 123) val arrayBackedUnsafeRow: UnsafeRow = - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9554] [SQL] Enables in-memory partition pruning by default
Repository: spark Updated Branches: refs/heads/branch-1.5 5452e93f0 - 6d46e9b7c [SPARK-9554] [SQL] Enables in-memory partition pruning by default Author: Cheng Lian l...@databricks.com Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits: 67c403e [Cheng Lian] Enables in-memory partition pruning by default (cherry picked from commit 703e44bff19f4c394f6f9bff1ce9152cdc68c51e) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6d46e9b7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6d46e9b7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6d46e9b7 Branch: refs/heads/branch-1.5 Commit: 6d46e9b7c8ffde5d3cc3d86b005c40c51934e56b Parents: 5452e93 Author: Cheng Lian l...@databricks.com Authored: Mon Aug 3 12:06:58 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 12:07:09 2015 -0700 -- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6d46e9b7/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala index 387960c..41ba1c7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala @@ -200,7 +200,7 @@ private[spark] object SQLConf { val IN_MEMORY_PARTITION_PRUNING = booleanConf(spark.sql.inMemoryColumnarStorage.partitionPruning, - defaultValue = Some(false), + defaultValue = Some(true), doc = When true, enable partition pruning for in-memory columnar tables., isPublic = false) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9554] [SQL] Enables in-memory partition pruning by default
Repository: spark Updated Branches: refs/heads/master 7a9d09f0b - 703e44bff [SPARK-9554] [SQL] Enables in-memory partition pruning by default Author: Cheng Lian l...@databricks.com Closes #7895 from liancheng/spark-9554/enable-in-memory-partition-pruning and squashes the following commits: 67c403e [Cheng Lian] Enables in-memory partition pruning by default Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/703e44bf Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/703e44bf Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/703e44bf Branch: refs/heads/master Commit: 703e44bff19f4c394f6f9bff1ce9152cdc68c51e Parents: 7a9d09f Author: Cheng Lian l...@databricks.com Authored: Mon Aug 3 12:06:58 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 12:06:58 2015 -0700 -- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/703e44bf/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala index 387960c..41ba1c7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala @@ -200,7 +200,7 @@ private[spark] object SQLConf { val IN_MEMORY_PARTITION_PRUNING = booleanConf(spark.sql.inMemoryColumnarStorage.partitionPruning, - defaultValue = Some(false), + defaultValue = Some(true), doc = When true, enable partition pruning for in-memory columnar tables., isPublic = false) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor
Repository: spark Updated Branches: refs/heads/branch-1.5 6d46e9b7c - b3117d312 [SPARK-5133] [ML] Added featureImportance to RandomForestClassifier and Regressor Added featureImportance to RandomForestClassifier and Regressor. This follows the scikit-learn implementation here: [https://github.com/scikit-learn/scikit-learn/blob/a95203b249c1cf392f86d001ad999e29b2392739/sklearn/tree/_tree.pyx#L3341] CC: yanboliang Would you mind taking a look? Thanks! Author: Joseph K. Bradley jos...@databricks.com Author: Feynman Liang fli...@databricks.com Closes #7838 from jkbradley/dt-feature-importance and squashes the following commits: 72a167a [Joseph K. Bradley] fixed unit test 86cea5f [Joseph K. Bradley] Modified RF featuresImportances to return Vector instead of Map 5aa74f0 [Joseph K. Bradley] finally fixed unit test for real 33df5db [Joseph K. Bradley] fix unit test 42a2d3b [Joseph K. Bradley] fix unit test fe94e72 [Joseph K. Bradley] modified feature importance unit tests cc693ee [Feynman Liang] Add classifier tests 79a6f87 [Feynman Liang] Compare dense vectors in test 21d01fc [Feynman Liang] Added failing SKLearn test ac0b254 [Joseph K. Bradley] Added featureImportance to RandomForestClassifier/Regressor. Need to add unit tests (cherry picked from commit ff9169a002f1b75231fd25b7d04157a912503038) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b3117d31 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b3117d31 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b3117d31 Branch: refs/heads/branch-1.5 Commit: b3117d312332af3b4bd416857f632cacb5230feb Parents: 6d46e9b Author: Joseph K. Bradley jos...@databricks.com Authored: Mon Aug 3 12:17:46 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 12:17:56 2015 -0700 -- .../classification/RandomForestClassifier.scala | 30 +- .../ml/regression/RandomForestRegressor.scala | 33 -- .../scala/org/apache/spark/ml/tree/Node.scala | 19 +++- .../spark/ml/tree/impl/RandomForest.scala | 92 .../org/apache/spark/ml/tree/treeModels.scala | 6 ++ .../JavaRandomForestClassifierSuite.java| 2 + .../JavaRandomForestRegressorSuite.java | 2 + .../RandomForestClassifierSuite.scala | 31 +- .../org/apache/spark/ml/impl/TreeTests.scala| 18 .../regression/RandomForestRegressorSuite.scala | 27 - .../spark/ml/tree/impl/RandomForestSuite.scala | 107 +++ 11 files changed, 351 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b3117d31/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala index 56e80cc..b59826a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala @@ -95,7 +95,8 @@ final class RandomForestClassifier(override val uid: String) val trees = RandomForest.run(oldDataset, strategy, getNumTrees, getFeatureSubsetStrategy, getSeed) .map(_.asInstanceOf[DecisionTreeClassificationModel]) -new RandomForestClassificationModel(trees, numClasses) +val numFeatures = oldDataset.first().features.size +new RandomForestClassificationModel(trees, numFeatures, numClasses) } override def copy(extra: ParamMap): RandomForestClassifier = defaultCopy(extra) @@ -118,11 +119,13 @@ object RandomForestClassifier { * features. * @param _trees Decision trees in the ensemble. * Warning: These have null parents. + * @param numFeatures Number of features used by this model */ @Experimental final class RandomForestClassificationModel private[ml] ( override val uid: String, private val _trees: Array[DecisionTreeClassificationModel], +val numFeatures: Int, override val numClasses: Int) extends ProbabilisticClassificationModel[Vector, RandomForestClassificationModel] with TreeEnsembleModel with Serializable { @@ -133,8 +136,8 @@ final class RandomForestClassificationModel private[ml] ( * Construct a random forest classification model, with all trees weighted equally. * @param trees Component trees */ - def this(trees: Array[DecisionTreeClassificationModel], numClasses: Int) = -this(Identifiable.randomUID(rfc), trees, numClasses) + def this(trees: Array[DecisionTreeClassificationModel], numFeatures: Int, numClasses: Int)
Git Push Summary
Repository: spark Updated Branches: refs/heads/branch-1.5 [created] b41a32718 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9511] [SQL] Fixed Table Name Parsing
Repository: spark Updated Branches: refs/heads/master b41a32718 - dfe7bd168 [SPARK-9511] [SQL] Fixed Table Name Parsing The issue was that the tokenizer was parsing 1one into the numeric 1 using the code on line 110. I added another case to accept strings that start with a number and then have a letter somewhere else in it as well. Author: Joseph Batchik joseph.batc...@cloudera.com Closes #7844 from JDrit/parse_error and squashes the following commits: b8ca12f [Joseph Batchik] fixed parsing issue by adding another case Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dfe7bd16 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dfe7bd16 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dfe7bd16 Branch: refs/heads/master Commit: dfe7bd168d9bcf8c53f993f459ab473d893457b0 Parents: b41a327 Author: Joseph Batchik joseph.batc...@cloudera.com Authored: Mon Aug 3 11:17:38 2015 -0700 Committer: Michael Armbrust mich...@databricks.com Committed: Mon Aug 3 11:17:38 2015 -0700 -- .../spark/sql/catalyst/AbstractSparkSQLParser.scala | 2 ++ .../test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 10 ++ 2 files changed, 12 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dfe7bd16/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala index d494ae7..5898a5f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/AbstractSparkSQLParser.scala @@ -104,6 +104,8 @@ class SqlLexical extends StdLexical { override lazy val token: Parser[Token] = ( identChar ~ (identChar | digit).* ^^ { case first ~ rest = processIdent((first :: rest).mkString) } +| digit.* ~ identChar ~ (identChar | digit).* ^^ + { case first ~ middle ~ rest = processIdent((first ++ (middle :: rest)).mkString) } | rep1(digit) ~ ('.' ~ digit.*).? ^^ { case i ~ None = NumericLit(i.mkString) case i ~ Some(d) = FloatLit(i.mkString + . + d.mkString) http://git-wip-us.apache.org/repos/asf/spark/blob/dfe7bd16/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index bbadc20..f1abae0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -1604,4 +1604,14 @@ class SQLQuerySuite extends QueryTest with BeforeAndAfterAll with SQLTestUtils { checkAnswer(df.select(-df(i)), Row(new CalendarInterval(-(12 * 3 - 3), -(7L * MICROS_PER_WEEK + 123 } + + test(SPARK-9511: error with table starting with number) { +val df = sqlContext.sparkContext.parallelize(1 to 10).map(i = (i, i.toString)) + .toDF(num, str) +df.registerTempTable(1one) + +checkAnswer(sqlContext.sql(select count(num) from 1one), Row(10)) + +sqlContext.dropTempTable(1one) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9528] [ML] Changed RandomForestClassifier to extend ProbabilisticClassifier
Repository: spark Updated Branches: refs/heads/master 8be198c86 - 69f5a7c93 [SPARK-9528] [ML] Changed RandomForestClassifier to extend ProbabilisticClassifier RandomForestClassifier now outputs rawPrediction based on tree probabilities, plus probability column computed from normalized rawPrediction. CC: holdenk Author: Joseph K. Bradley jos...@databricks.com Closes #7859 from jkbradley/rf-prob and squashes the following commits: 6c28f51 [Joseph K. Bradley] Changed RandomForestClassifier to extend ProbabilisticClassifier Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69f5a7c9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69f5a7c9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69f5a7c9 Branch: refs/heads/master Commit: 69f5a7c934ac553ed52c00679b800bcffe83c1d6 Parents: 8be198c Author: Joseph K. Bradley jos...@databricks.com Authored: Mon Aug 3 10:46:34 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 10:46:34 2015 -0700 -- .../classification/DecisionTreeClassifier.scala | 8 + .../ProbabilisticClassifier.scala | 27 +- .../classification/RandomForestClassifier.scala | 37 ++-- .../RandomForestClassifierSuite.scala | 36 ++- 4 files changed, 81 insertions(+), 27 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/69f5a7c9/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala index f27cfd0..f2b992f 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala @@ -131,13 +131,7 @@ final class DecisionTreeClassificationModel private[ml] ( override protected def raw2probabilityInPlace(rawPrediction: Vector): Vector = { rawPrediction match { case dv: DenseVector = -var i = 0 -val size = dv.size -val sum = dv.values.sum -while (i size) { - dv.values(i) = if (sum != 0) dv.values(i) / sum else 0.0 - i += 1 -} +ProbabilisticClassificationModel.normalizeToProbabilitiesInPlace(dv) dv case sv: SparseVector = throw new RuntimeException(Unexpected error in DecisionTreeClassificationModel: + http://git-wip-us.apache.org/repos/asf/spark/blob/69f5a7c9/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala index dad4511..f9c9c23 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala @@ -20,7 +20,7 @@ package org.apache.spark.ml.classification import org.apache.spark.annotation.DeveloperApi import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util.SchemaUtils -import org.apache.spark.mllib.linalg.{Vector, VectorUDT} +import org.apache.spark.mllib.linalg.{SparseVector, DenseVector, Vector, VectorUDT} import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DoubleType, DataType, StructType} @@ -175,3 +175,28 @@ private[spark] abstract class ProbabilisticClassificationModel[ */ protected def probability2prediction(probability: Vector): Double = probability.argmax } + +private[ml] object ProbabilisticClassificationModel { + + /** + * Normalize a vector of raw predictions to be a multinomial probability vector, in place. + * + * The input raw predictions should be = 0. + * The output vector sums to 1, unless the input vector is all-0 (in which case the output is + * all-0 too). + * + * NOTE: This is NOT applicable to all models, only ones which effectively use class + * instance counts for raw predictions. + */ + def normalizeToProbabilitiesInPlace(v: DenseVector): Unit = { +val sum = v.values.sum +if (sum != 0) { + var i = 0 + val size = v.size + while (i size) { +v.values(i) /= sum +i += 1 + } +} + } +} http://git-wip-us.apache.org/repos/asf/spark/blob/69f5a7c9/mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
spark git commit: [SPARK-1855] Local checkpointing
Repository: spark Updated Branches: refs/heads/master 69f5a7c93 - b41a32718 [SPARK-1855] Local checkpointing Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*. **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently. The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf). Author: Andrew Or and...@databricks.com Closes #7279 from andrewor14/local-checkpoint and squashes the following commits: 729600f [Andrew Or] Oops, fix tests 34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 3be5aea [Andrew Or] Address comments bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint ab003a3 [Andrew Or] Fix compile c2e111b [Andrew Or] Address comments 33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint e908a42 [Andrew Or] Fix tests f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level a92657d [Andrew Or] Update a few comments e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 1bbe154 [Andrew Or] Simplify LocalCheckpointRDD 48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests 62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD 87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint c449b38 [Andrew Or] Fix style 4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing 53b363b [Andrew Or] Rename a few more awkwardly named methods (minor) e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups 4880deb [Andrew Or] Fix style d096c67 [Andrew Or] Fix mima 172cb66 [Andrew Or] Fix mima? e53d964 [Andrew Or] Fix style 56831c5 [Andrew Or] Add a few warnings and clear exception messages 2e59646 [Andrew Or] Add local checkpoint clean up tests 4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing 4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups 0477eec [Andrew Or] Rename a few methods with awkward names (minor) 2e902e5 [Andrew Or] First implementation of local checkpointing 8447454 [Andrew Or] Fix tests 4ac1896 [Andrew Or] Refactor checkpoint interface for modularity Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b41a3271 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b41a3271 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b41a3271 Branch: refs/heads/master Commit: b41a32718d615b304efba146bf97be0229779b01 Parents: 69f5a7c Author: Andrew Or and...@databricks.com Authored: Mon Aug 3 10:58:37 2015 -0700 Committer: Tathagata Das tathagata.das1...@gmail.com Committed: Mon Aug 3 10:58:37 2015 -0700 -- .../scala/org/apache/spark/ContextCleaner.scala | 9 +- .../scala/org/apache/spark/SparkContext.scala | 2 +- .../scala/org/apache/spark/TaskContext.scala| 8 + .../org/apache/spark/rdd/CheckpointRDD.scala| 153 + .../apache/spark/rdd/LocalCheckpointRDD.scala | 67 .../spark/rdd/LocalRDDCheckpointData.scala | 83 + .../main/scala/org/apache/spark/rdd/RDD.scala | 128 +-- .../apache/spark/rdd/RDDCheckpointData.scala| 106 ++ .../spark/rdd/ReliableCheckpointRDD.scala | 172 ++ .../spark/rdd/ReliableRDDCheckpointData.scala | 108 ++ .../org/apache/spark/CheckpointSuite.scala | 164 + .../org/apache/spark/ContextCleanerSuite.scala | 61 +++- .../apache/spark/rdd/LocalCheckpointSuite.scala | 330 +++ project/MimaExcludes.scala | 9 +- 14 files changed, 1085 insertions(+), 315 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b41a3271/core/src/main/scala/org/apache/spark/ContextCleaner.scala
spark git commit: [SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer
Repository: spark Updated Branches: refs/heads/master 95dccc633 - 137f47865 [SPARK-9551][SQL] add a cheap version of copy for UnsafeRow to reuse a copy buffer Author: Wenchen Fan cloud0...@outlook.com Closes #7885 from cloud-fan/cheap-copy and squashes the following commits: 0900ca1 [Wenchen Fan] replace == with === 73f4ada [Wenchen Fan] add tests 07b865a [Wenchen Fan] add a cheap version of copy Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/137f4786 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/137f4786 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/137f4786 Branch: refs/heads/master Commit: 137f47865df6e98ab70ae5ba30dc4d441fb41166 Parents: 95dccc6 Author: Wenchen Fan cloud0...@outlook.com Authored: Mon Aug 3 04:21:15 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 04:21:15 2015 -0700 -- .../sql/catalyst/expressions/UnsafeRow.java | 32 + .../org/apache/spark/sql/UnsafeRowSuite.scala | 38 2 files changed, 70 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/137f4786/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java -- diff --git a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java index c5d42d7..f4230cf 100644 --- a/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java +++ b/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java @@ -464,6 +464,38 @@ public final class UnsafeRow extends MutableRow { } /** + * Creates an empty UnsafeRow from a byte array with specified numBytes and numFields. + * The returned row is invalid until we call copyFrom on it. + */ + public static UnsafeRow createFromByteArray(int numBytes, int numFields) { +final UnsafeRow row = new UnsafeRow(); +row.pointTo(new byte[numBytes], numFields, numBytes); +return row; + } + + /** + * Copies the input UnsafeRow to this UnsafeRow, and resize the underlying byte[] when the + * input row is larger than this row. + */ + public void copyFrom(UnsafeRow row) { +// copyFrom is only available for UnsafeRow created from byte array. +assert (baseObject instanceof byte[]) baseOffset == PlatformDependent.BYTE_ARRAY_OFFSET; +if (row.sizeInBytes this.sizeInBytes) { + // resize the underlying byte[] if it's not large enough. + this.baseObject = new byte[row.sizeInBytes]; +} +PlatformDependent.copyMemory( + row.baseObject, + row.baseOffset, + this.baseObject, + this.baseOffset, + row.sizeInBytes +); +// update the sizeInBytes. +this.sizeInBytes = row.sizeInBytes; + } + + /** * Write this UnsafeRow's underlying bytes to the given OutputStream. * * @param out the stream to write to. http://git-wip-us.apache.org/repos/asf/spark/blob/137f4786/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala index e72a1bc..c5faaa6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/UnsafeRowSuite.scala @@ -82,4 +82,42 @@ class UnsafeRowSuite extends SparkFunSuite { assert(unsafeRow.get(0, dataType) === null) } } + + test(createFromByteArray and copyFrom) { +val row = InternalRow(1, UTF8String.fromString(abc)) +val converter = UnsafeProjection.create(Array[DataType](IntegerType, StringType)) +val unsafeRow = converter.apply(row) + +val emptyRow = UnsafeRow.createFromByteArray(64, 2) +val buffer = emptyRow.getBaseObject + +emptyRow.copyFrom(unsafeRow) +assert(emptyRow.getSizeInBytes() === unsafeRow.getSizeInBytes) +assert(emptyRow.getInt(0) === unsafeRow.getInt(0)) +assert(emptyRow.getUTF8String(1) === unsafeRow.getUTF8String(1)) +// make sure we reuse the buffer. +assert(emptyRow.getBaseObject === buffer) + +// make sure we really copied the input row. +unsafeRow.setInt(0, 2) +assert(emptyRow.getInt(0) === 1) + +val longString = UTF8String.fromString((1 to 100).map(_ = abc).reduce(_ + _)) +val row2 = InternalRow(3, longString) +val unsafeRow2 = converter.apply(row2) + +// make sure we can resize. +emptyRow.copyFrom(unsafeRow2) +assert(emptyRow.getSizeInBytes() === unsafeRow2.getSizeInBytes) +assert(emptyRow.getInt(0) === 3) +
spark git commit: Two minor comments from code review on 191bf2689.
Repository: spark Updated Branches: refs/heads/master 191bf2689 - 8be198c86 Two minor comments from code review on 191bf2689. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8be198c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8be198c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8be198c8 Branch: refs/heads/master Commit: 8be198c86935001907727fd16577231ff776125b Parents: 191bf26 Author: Reynold Xin r...@databricks.com Authored: Mon Aug 3 04:26:18 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 04:26:18 2015 -0700 -- .../sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala | 2 +- .../expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8be198c8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala index 5f8a6f8..30b51dd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala @@ -76,7 +76,7 @@ object GenerateUnsafeRowJoiner extends CodeGenerator[(StructType, StructType), U } else if (i - bitset1Words bitset2Words - 1) { // combine next two words of bitset2 s($getLong(obj2, offset2 + ${(i - bitset1Words) * 8}) (64 - $bitset1Remainder)) + -s| ($getLong(obj2, offset2 + ${(i - bitset1Words + 1) * 8}) $bitset1Remainder) +s | ($getLong(obj2, offset2 + ${(i - bitset1Words + 1) * 8}) $bitset1Remainder) } else { // last word of bitset2 s$getLong(obj2, offset2 + ${(i - bitset1Words) * 8}) (64 - $bitset1Remainder) http://git-wip-us.apache.org/repos/asf/spark/blob/8be198c8/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala index 718a2ac..aff1bee 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoinerBitsetSuite.scala @@ -92,6 +92,8 @@ class GenerateUnsafeRowJoinerBitsetSuite extends SparkFunSuite { private def createUnsafeRow(numFields: Int): UnsafeRow = { val row = new UnsafeRow val sizeInBytes = numFields * 8 + ((numFields + 63) / 64) * 8 +// Allocate a larger buffer than needed and point the UnsafeRow to somewhere in the middle. +// This way we can test the joiner when the input UnsafeRows are not the entire arrays. val offset = numFields * 8 val buf = new Array[Byte](sizeInBytes + offset) row.pointTo(buf, PlatformDependent.BYTE_ARRAY_OFFSET + offset, numFields, sizeInBytes) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7563] (backport for 1.3) OutputCommitCoordinator.stop() should only run on the driver
Repository: spark Updated Branches: refs/heads/branch-1.3 cc5f711c0 - 265ec35bc [SPARK-7563] (backport for 1.3) OutputCommitCoordinator.stop() should only run on the driver Backport of [SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver for 1.3 Author: Sean Owen so...@cloudera.com Closes #7865 from srowen/SPARK-7563-1.3 and squashes the following commits: f4479bc [Sean Owen] Backport of [SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver for 1.3 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/265ec35b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/265ec35b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/265ec35b Branch: refs/heads/branch-1.3 Commit: 265ec35bc8938939ac55d90b09e6a1a3773155eb Parents: cc5f711 Author: Sean Owen so...@cloudera.com Authored: Mon Aug 3 13:59:00 2015 +0100 Committer: Sean Owen so...@cloudera.com Committed: Mon Aug 3 13:59:00 2015 +0100 -- core/src/main/scala/org/apache/spark/SparkEnv.scala | 2 +- .../apache/spark/scheduler/OutputCommitCoordinator.scala | 10 ++ .../spark/scheduler/OutputCommitCoordinatorSuite.scala| 2 +- 3 files changed, 8 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/265ec35b/core/src/main/scala/org/apache/spark/SparkEnv.scala -- diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala b/core/src/main/scala/org/apache/spark/SparkEnv.scala index d95a176..9c24afb 100644 --- a/core/src/main/scala/org/apache/spark/SparkEnv.scala +++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala @@ -357,7 +357,7 @@ object SparkEnv extends Logging { } val outputCommitCoordinator = mockOutputCommitCoordinator.getOrElse { - new OutputCommitCoordinator(conf) + new OutputCommitCoordinator(conf, isDriver) } val outputCommitCoordinatorActor = registerOrLookup(OutputCommitCoordinator, new OutputCommitCoordinatorActor(outputCommitCoordinator)) http://git-wip-us.apache.org/repos/asf/spark/blob/265ec35b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala -- diff --git a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala index d89721c..4c70958 100644 --- a/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala +++ b/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala @@ -40,7 +40,7 @@ private case class AskPermissionToCommitOutput(stage: Int, task: Long, taskAttem * This class was introduced in SPARK-4879; see that JIRA issue (and the associated pull requests) * for an extensive design discussion. */ -private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging { +private[spark] class OutputCommitCoordinator(conf: SparkConf, isDriver: Boolean) extends Logging { // Initialized by SparkEnv var coordinatorActor: Option[ActorRef] = None @@ -134,9 +134,11 @@ private[spark] class OutputCommitCoordinator(conf: SparkConf) extends Logging { } def stop(): Unit = synchronized { -coordinatorActor.foreach(_ ! StopCoordinator) -coordinatorActor = None -authorizedCommittersByStage.clear() +if (isDriver) { + coordinatorActor.foreach(_ ! StopCoordinator) + coordinatorActor = None + authorizedCommittersByStage.clear() +} } // Marked private[scheduler] instead of private so this can be mocked in tests http://git-wip-us.apache.org/repos/asf/spark/blob/265ec35b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala -- diff --git a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala index 52f126f..14ecf72 100644 --- a/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala +++ b/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala @@ -81,7 +81,7 @@ class OutputCommitCoordinatorSuite extends SparkFunSuite with BeforeAndAfter { conf: SparkConf, isLocal: Boolean, listenerBus: LiveListenerBus): SparkEnv = { -outputCommitCoordinator = spy(new OutputCommitCoordinator(conf)) +outputCommitCoordinator = spy(new OutputCommitCoordinator(conf, isDriver = true)) // Use Mockito.spy() to maintain the default infrastructure everywhere else. // This mocking allows us to control the coordinator responses in
spark git commit: [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug
Repository: spark Updated Branches: refs/heads/master 137f47865 - 191bf2689 [SPARK-9518] [SQL] cleanup generated UnsafeRowJoiner and fix bug Currently, when copy the bitsets, we didn't consider that the row1 may not sit in the beginning of byte array. cc rxin Author: Davies Liu dav...@databricks.com Closes #7892 from davies/clean_join and squashes the following commits: 14cce9e [Davies Liu] cleanup generated UnsafeRowJoiner and fix bug Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/191bf268 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/191bf268 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/191bf268 Branch: refs/heads/master Commit: 191bf2689d127a9dd328b9cc517362fd51eaed3d Parents: 137f478 Author: Davies Liu dav...@databricks.com Authored: Mon Aug 3 04:23:26 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 04:23:26 2015 -0700 -- .../codegen/GenerateUnsafeRowJoiner.scala | 102 ++- .../GenerateUnsafeRowJoinerBitsetSuite.scala| 7 +- 2 files changed, 37 insertions(+), 72 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/191bf268/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala index 645eb48..5f8a6f8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala @@ -40,10 +40,6 @@ abstract class UnsafeRowJoiner { */ object GenerateUnsafeRowJoiner extends CodeGenerator[(StructType, StructType), UnsafeRowJoiner] { - def dump(word: Long): String = { -Seq.tabulate(64) { i = if ((word i) % 2 == 0) 0 else 1 }.reverse.mkString - } - override protected def create(in: (StructType, StructType)): UnsafeRowJoiner = { create(in._1, in._2) } @@ -56,76 +52,45 @@ object GenerateUnsafeRowJoiner extends CodeGenerator[(StructType, StructType), U } def create(schema1: StructType, schema2: StructType): UnsafeRowJoiner = { -val ctx = newCodeGenContext() val offset = PlatformDependent.BYTE_ARRAY_OFFSET +val getLong = PlatformDependent.UNSAFE.getLong +val putLong = PlatformDependent.UNSAFE.putLong val bitset1Words = (schema1.size + 63) / 64 val bitset2Words = (schema2.size + 63) / 64 val outputBitsetWords = (schema1.size + schema2.size + 63) / 64 val bitset1Remainder = schema1.size % 64 -val bitset2Remainder = schema2.size % 64 // The number of words we can reduce when we concat two rows together. // The only reduction comes from merging the bitset portion of the two rows, saving 1 word. val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords -// - copy bitset from row 1 --- // -val copyBitset1 = Seq.tabulate(bitset1Words) { i = - s - |PlatformDependent.UNSAFE.putLong(buf, ${offset + i * 8}, - | PlatformDependent.UNSAFE.getLong(obj1, ${offset + i * 8})); - .stripMargin -}.mkString - - -// - copy bitset from row 2 --- // -var copyBitset2 = -if (bitset1Remainder == 0) { - copyBitset2 += Seq.tabulate(bitset2Words) { i = -s - |PlatformDependent.UNSAFE.putLong(buf, ${offset + (bitset1Words + i) * 8}, - | PlatformDependent.UNSAFE.getLong(obj2, ${offset + i * 8})); - .stripMargin - }.mkString -} else { - copyBitset2 = Seq.tabulate(bitset2Words) { i = -s - |long bs2w$i = PlatformDependent.UNSAFE.getLong(obj2, ${offset + i * 8}); - |long bs2w${i}p1 = (bs2w$i $bitset1Remainder) ~((1L $bitset1Remainder) - 1); - |long bs2w${i}p2 = (bs2w$i ${64 - bitset1Remainder}); - .stripMargin - }.mkString - - copyBitset2 += Seq.tabulate(bitset2Words) { i = -val currentOffset = offset + (bitset1Words + i - 1) * 8 -if (i == 0) { - if (bitset1Words 0) { -s - |PlatformDependent.UNSAFE.putLong(buf, $currentOffset, - | bs2w${i}p1 | PlatformDependent.UNSAFE.getLong(obj1, $currentOffset)); -.stripMargin - } else { -s - |PlatformDependent.UNSAFE.putLong(buf, $currentOffset + 8, bs2w${i}p1); -.stripMargin - } +// -
spark git commit: [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples
Repository: spark Updated Branches: refs/heads/master ba1c4e138 - 8ca287ebb [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples Add ml.PCA user guide document and code examples for Scala/Java/Python. Author: Yanbo Liang yblia...@gmail.com Closes #7522 from yanboliang/ml-pca-md and squashes the following commits: 60dec05 [Yanbo Liang] address comments f992abe [Yanbo Liang] Add ml.PCA doc and examples Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ca287eb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ca287eb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ca287eb Branch: refs/heads/master Commit: 8ca287ebbd58985a568341b08040d0efa9d3641a Parents: ba1c4e1 Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 3 13:58:00 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 13:58:00 2015 -0700 -- docs/ml-features.md | 86 1 file changed, 86 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8ca287eb/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 54068de..fa0ad1f 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -461,6 +461,92 @@ for binarized_feature, in binarizedFeatures.collect(): /div /div +## PCA + +[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/index.html#org.apache.spark.ml.feature.PCA) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components. + +div class=codetabs +div data-lang=scala markdown=1 +See the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.feature.PCA) for API details. +{% highlight scala %} +import org.apache.spark.ml.feature.PCA +import org.apache.spark.mllib.linalg.Vectors + +val data = Array( + Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), + Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), + Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) +) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val pca = new PCA() + .setInputCol(features) + .setOutputCol(pcaFeatures) + .setK(3) + .fit(df) +val pcaDF = pca.transform(df) +val result = pcaDF.select(pcaFeatures) +result.show() +{% endhighlight %} +/div + +div data-lang=java markdown=1 +See the [Java API documentation](api/java/org/apache/spark/ml/feature/PCA.html) for API details. +{% highlight java %} +import com.google.common.collect.Lists; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.PCA +import org.apache.spark.ml.feature.PCAModel +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaSparkContext jsc = ... +SQLContext jsql = ... +JavaRDDRow data = jsc.parallelize(Lists.newArrayList( + RowFactory.create(Vectors.sparse(5, new int[]{1, 3}, new double[]{1.0, 7.0})), + RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)), + RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)) +)); +StructType schema = new StructType(new StructField[] { + new StructField(features, new VectorUDT(), false, Metadata.empty()), +}); +DataFrame df = jsql.createDataFrame(data, schema); +PCAModel pca = new PCA() + .setInputCol(features) + .setOutputCol(pcaFeatures) + .setK(3) + .fit(df); +DataFrame result = pca.transform(df).select(pcaFeatures); +result.show(); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +See the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.feature.PCA) for API details. +{% highlight python %} +from pyspark.ml.feature import PCA +from pyspark.mllib.linalg import Vectors + +data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), + (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), + (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] +df = sqlContext.createDataFrame(data,[features]) +pca = PCA(k=3, inputCol=features, outputCol=pcaFeatures) +model = pca.fit(df) +result = model.transform(df).select(pcaFeatures) +result.show(truncate=False) +{% endhighlight %} +/div +/div + ## PolynomialExpansion
spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula
Repository: spark Updated Branches: refs/heads/branch-1.5 444058d91 - dc0c8c982 [SPARK-9544] [MLLIB] add Python API for RFormula Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder Author: Xiangrui Meng m...@databricks.com Closes #7879 from mengxr/SPARK-9544 and squashes the following commits: 3d5ff03 [Xiangrui Meng] add an doctest for . and - 5e969a5 [Xiangrui Meng] fix pydoc 1cd41f8 [Xiangrui Meng] organize imports 3c18b10 [Xiangrui Meng] add Python API for RFormula (cherry picked from commit e4765a46833baff1dd7465c4cf50e947de7e8f21) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc0c8c98 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc0c8c98 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc0c8c98 Branch: refs/heads/branch-1.5 Commit: dc0c8c982825c3c58b7c6c4570c03ba97dba608b Parents: 444058d Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 3 13:59:35 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 3 13:59:45 2015 -0700 -- .../org/apache/spark/ml/feature/RFormula.scala | 21 ++--- python/pyspark/ml/feature.py| 85 +++- 2 files changed, 91 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index d172691..d5360c9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -19,16 +19,14 @@ package org.apache.spark.ml.feature import scala.collection.mutable import scala.collection.mutable.ArrayBuffer -import scala.util.parsing.combinator.RegexParsers import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} import org.apache.spark.ml.param.{Param, ParamMap} import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} import org.apache.spark.ml.util.Identifiable import org.apache.spark.mllib.linalg.VectorUDT import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ /** @@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends Estimator[RFormulaModel] with R */ val formula: Param[String] = new Param(this, formula, R model formula) - private var parsedFormula: Option[ParsedRFormula] = None - /** * Sets the formula to use for this transformer. Must be called before use. * @group setParam * @param value an R formula in string form (e.g. y ~ x + z) */ - def setFormula(value: String): this.type = { -parsedFormula = Some(RFormulaParser.parse(value)) -set(formula, value) -this - } + def setFormula(value: String): this.type = set(formula, value) /** @group getParam */ def getFormula: String = $(formula) /** Whether the formula specifies fitting an intercept. */ private[ml] def hasIntercept: Boolean = { -require(parsedFormula.isDefined, Must call setFormula() first.) -parsedFormula.get.hasIntercept +require(isDefined(formula), Formula must be defined first.) +RFormulaParser.parse($(formula)).hasIntercept } override def fit(dataset: DataFrame): RFormulaModel = { -require(parsedFormula.isDefined, Must call setFormula() first.) -val resolvedFormula = parsedFormula.get.resolve(dataset.schema) +require(isDefined(formula), Formula must be defined first.) +val parsedFormula = RFormulaParser.parse($(formula)) +val resolvedFormula = parsedFormula.resolve(dataset.schema) // StringType terms and terms representing interactions need to be encoded before assembly. // TODO(ekl) add support for feature interactions val encoderStages = ArrayBuffer[PipelineStage]() http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 015e7a9..3f04c41 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 'OneHotEncoder',
spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula
Repository: spark Updated Branches: refs/heads/master 8ca287ebb - e4765a468 [SPARK-9544] [MLLIB] add Python API for RFormula Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder Author: Xiangrui Meng m...@databricks.com Closes #7879 from mengxr/SPARK-9544 and squashes the following commits: 3d5ff03 [Xiangrui Meng] add an doctest for . and - 5e969a5 [Xiangrui Meng] fix pydoc 1cd41f8 [Xiangrui Meng] organize imports 3c18b10 [Xiangrui Meng] add Python API for RFormula Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4765a46 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4765a46 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4765a46 Branch: refs/heads/master Commit: e4765a46833baff1dd7465c4cf50e947de7e8f21 Parents: 8ca287e Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 3 13:59:35 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 3 13:59:35 2015 -0700 -- .../org/apache/spark/ml/feature/RFormula.scala | 21 ++--- python/pyspark/ml/feature.py| 85 +++- 2 files changed, 91 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index d172691..d5360c9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -19,16 +19,14 @@ package org.apache.spark.ml.feature import scala.collection.mutable import scala.collection.mutable.ArrayBuffer -import scala.util.parsing.combinator.RegexParsers import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} import org.apache.spark.ml.param.{Param, ParamMap} import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} import org.apache.spark.ml.util.Identifiable import org.apache.spark.mllib.linalg.VectorUDT import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ /** @@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends Estimator[RFormulaModel] with R */ val formula: Param[String] = new Param(this, formula, R model formula) - private var parsedFormula: Option[ParsedRFormula] = None - /** * Sets the formula to use for this transformer. Must be called before use. * @group setParam * @param value an R formula in string form (e.g. y ~ x + z) */ - def setFormula(value: String): this.type = { -parsedFormula = Some(RFormulaParser.parse(value)) -set(formula, value) -this - } + def setFormula(value: String): this.type = set(formula, value) /** @group getParam */ def getFormula: String = $(formula) /** Whether the formula specifies fitting an intercept. */ private[ml] def hasIntercept: Boolean = { -require(parsedFormula.isDefined, Must call setFormula() first.) -parsedFormula.get.hasIntercept +require(isDefined(formula), Formula must be defined first.) +RFormulaParser.parse($(formula)).hasIntercept } override def fit(dataset: DataFrame): RFormulaModel = { -require(parsedFormula.isDefined, Must call setFormula() first.) -val resolvedFormula = parsedFormula.get.resolve(dataset.schema) +require(isDefined(formula), Formula must be defined first.) +val parsedFormula = RFormulaParser.parse($(formula)) +val resolvedFormula = parsedFormula.resolve(dataset.schema) // StringType terms and terms representing interactions need to be encoded before assembly. // TODO(ekl) add support for feature interactions val encoderStages = ArrayBuffer[PipelineStage]() http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 015e7a9..3f04c41 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 'OneHotEncoder', 'PolynomialExpansion', 'RegexTokenizer', 'StandardScaler', 'StandardScalerModel', 'StringIndexer', 'StringIndexerModel',
spark git commit: [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples
Repository: spark Updated Branches: refs/heads/branch-1.5 dc0c8c982 - e7329ab31 [SPARK-9191] [ML] [Doc] Add ml.PCA user guide and code examples Add ml.PCA user guide document and code examples for Scala/Java/Python. Author: Yanbo Liang yblia...@gmail.com Closes #7522 from yanboliang/ml-pca-md and squashes the following commits: 60dec05 [Yanbo Liang] address comments f992abe [Yanbo Liang] Add ml.PCA doc and examples (cherry picked from commit 8ca287ebbd58985a568341b08040d0efa9d3641a) Signed-off-by: Joseph K. Bradley jos...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e7329ab3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e7329ab3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e7329ab3 Branch: refs/heads/branch-1.5 Commit: e7329ab31323a89d1e07c808927e5543876e3ce3 Parents: dc0c8c9 Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 3 13:58:00 2015 -0700 Committer: Joseph K. Bradley jos...@databricks.com Committed: Mon Aug 3 14:01:18 2015 -0700 -- docs/ml-features.md | 86 1 file changed, 86 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e7329ab3/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index 54068de..fa0ad1f 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -461,6 +461,92 @@ for binarized_feature, in binarizedFeatures.collect(): /div /div +## PCA + +[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/index.html#org.apache.spark.ml.feature.PCA) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components. + +div class=codetabs +div data-lang=scala markdown=1 +See the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.feature.PCA) for API details. +{% highlight scala %} +import org.apache.spark.ml.feature.PCA +import org.apache.spark.mllib.linalg.Vectors + +val data = Array( + Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), + Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0), + Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) +) +val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF(features) +val pca = new PCA() + .setInputCol(features) + .setOutputCol(pcaFeatures) + .setK(3) + .fit(df) +val pcaDF = pca.transform(df) +val result = pcaDF.select(pcaFeatures) +result.show() +{% endhighlight %} +/div + +div data-lang=java markdown=1 +See the [Java API documentation](api/java/org/apache/spark/ml/feature/PCA.html) for API details. +{% highlight java %} +import com.google.common.collect.Lists; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.ml.feature.PCA +import org.apache.spark.ml.feature.PCAModel +import org.apache.spark.mllib.linalg.VectorUDT; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +JavaSparkContext jsc = ... +SQLContext jsql = ... +JavaRDDRow data = jsc.parallelize(Lists.newArrayList( + RowFactory.create(Vectors.sparse(5, new int[]{1, 3}, new double[]{1.0, 7.0})), + RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)), + RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)) +)); +StructType schema = new StructType(new StructField[] { + new StructField(features, new VectorUDT(), false, Metadata.empty()), +}); +DataFrame df = jsql.createDataFrame(data, schema); +PCAModel pca = new PCA() + .setInputCol(features) + .setOutputCol(pcaFeatures) + .setK(3) + .fit(df); +DataFrame result = pca.transform(df).select(pcaFeatures); +result.show(); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +See the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.feature.PCA) for API details. +{% highlight python %} +from pyspark.ml.feature import PCA +from pyspark.mllib.linalg import Vectors + +data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),), + (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), + (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] +df = sqlContext.createDataFrame(data,[features]) +pca = PCA(k=3, inputCol=features, outputCol=pcaFeatures) +model = pca.fit(df) +result =
spark git commit: [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults.
Repository: spark Updated Branches: refs/heads/master ff9169a00 - ba1c4e138 [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults. Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs. Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits: a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ba1c4e13 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ba1c4e13 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ba1c4e13 Branch: refs/heads/master Commit: ba1c4e138de2ea84b55def4eed2bd363e60aea4d Parents: ff9169a Author: Kousuke Saruta saru...@oss.nttdata.co.jp Authored: Mon Aug 3 12:53:44 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 12:53:44 2015 -0700 -- conf/spark-env.sh.template | 1 + docs/monitoring.md | 2 +- docs/spark-standalone.md | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ba1c4e13/conf/spark-env.sh.template -- diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template index 192d3ae..c05fe38 100755 --- a/conf/spark-env.sh.template +++ b/conf/spark-env.sh.template @@ -38,6 +38,7 @@ # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) +# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g). # - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. -Dx=y) # - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. -Dx=y) # - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. -Dx=y) http://git-wip-us.apache.org/repos/asf/spark/blob/ba1c4e13/docs/monitoring.md -- diff --git a/docs/monitoring.md b/docs/monitoring.md index bcf885f..cedceb2 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -48,7 +48,7 @@ follows: trth style=width:21%Environment Variable/ththMeaning/th/tr tr tdcodeSPARK_DAEMON_MEMORY/code/td -tdMemory to allocate to the history server (default: 512m)./td +tdMemory to allocate to the history server (default: 1g)./td /tr tr tdcodeSPARK_DAEMON_JAVA_OPTS/code/td http://git-wip-us.apache.org/repos/asf/spark/blob/ba1c4e13/docs/spark-standalone.md -- diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 4f71fbc..2fe9ec3 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -152,7 +152,7 @@ You can optionally configure the cluster further by setting environment variable /tr tr tdcodeSPARK_DAEMON_MEMORY/code/td -tdMemory to allocate to the Spark master and worker daemons themselves (default: 512m)./td +tdMemory to allocate to the Spark master and worker daemons themselves (default: 1g)./td /tr tr tdcodeSPARK_DAEMON_JAVA_OPTS/code/td - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults.
Repository: spark Updated Branches: refs/heads/branch-1.5 b3117d312 - 444058d91 [SPARK-9558][DOCS]Update docs to follow the increase of memory defaults. Now the memory defaults of master and slave in Standalone mode and History Server is 1g, not 512m. So let's update docs. Author: Kousuke Saruta saru...@oss.nttdata.co.jp Closes #7896 from sarutak/update-doc-for-daemon-memory and squashes the following commits: a77626c [Kousuke Saruta] Fix docs to follow the update of increase of memory defaults (cherry picked from commit ba1c4e138de2ea84b55def4eed2bd363e60aea4d) Signed-off-by: Reynold Xin r...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/444058d9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/444058d9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/444058d9 Branch: refs/heads/branch-1.5 Commit: 444058d9158d426ae455208f07bf9c202e8f9925 Parents: b3117d3 Author: Kousuke Saruta saru...@oss.nttdata.co.jp Authored: Mon Aug 3 12:53:44 2015 -0700 Committer: Reynold Xin r...@databricks.com Committed: Mon Aug 3 12:53:58 2015 -0700 -- conf/spark-env.sh.template | 1 + docs/monitoring.md | 2 +- docs/spark-standalone.md | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/444058d9/conf/spark-env.sh.template -- diff --git a/conf/spark-env.sh.template b/conf/spark-env.sh.template index 192d3ae..c05fe38 100755 --- a/conf/spark-env.sh.template +++ b/conf/spark-env.sh.template @@ -38,6 +38,7 @@ # - SPARK_WORKER_INSTANCES, to set the number of worker processes per node # - SPARK_WORKER_DIR, to set the working directory of worker processes # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. -Dx=y) +# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g). # - SPARK_HISTORY_OPTS, to set config properties only for the history server (e.g. -Dx=y) # - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (e.g. -Dx=y) # - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (e.g. -Dx=y) http://git-wip-us.apache.org/repos/asf/spark/blob/444058d9/docs/monitoring.md -- diff --git a/docs/monitoring.md b/docs/monitoring.md index bcf885f..cedceb2 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -48,7 +48,7 @@ follows: trth style=width:21%Environment Variable/ththMeaning/th/tr tr tdcodeSPARK_DAEMON_MEMORY/code/td -tdMemory to allocate to the history server (default: 512m)./td +tdMemory to allocate to the history server (default: 1g)./td /tr tr tdcodeSPARK_DAEMON_JAVA_OPTS/code/td http://git-wip-us.apache.org/repos/asf/spark/blob/444058d9/docs/spark-standalone.md -- diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md index 4f71fbc..2fe9ec3 100644 --- a/docs/spark-standalone.md +++ b/docs/spark-standalone.md @@ -152,7 +152,7 @@ You can optionally configure the cluster further by setting environment variable /tr tr tdcodeSPARK_DAEMON_MEMORY/code/td -tdMemory to allocate to the Spark master and worker daemons themselves (default: 512m)./td +tdMemory to allocate to the Spark master and worker daemons themselves (default: 1g)./td /tr tr tdcodeSPARK_DAEMON_JAVA_OPTS/code/td - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org