[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96566171 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- I think this version is good for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96549456 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None --- End diff -- ok. Let's throw an exception at here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16628: Update known_translations for contributor names
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16628 cc @lw-lin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16628: Update known_translations for contributor names
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/16628 Update known_translations for contributor names ## What changes were proposed in this pull request? Update known_translations per https://github.com/apache/spark/pull/16423#issuecomment-269739634 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark known_translations Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16628.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16628 commit bf2f532a8b7ef60d1542415c0d6dacd8571e7bef Author: Yin Huai <yh...@databricks.com> Date: 2017-01-18T00:50:08Z Update known_translations for contributor names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16423: Update known_translations for contributor names and also...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16423 Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316695 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- This rule is in the same batch with PreprocessTableInsertion, right? If so, we cannot guarantee that PreprocessTableInsertion will always fire first for a command before InsertIntoTable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317272 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { +val conf = job.getConfiguration +val tableDesc = fileSinkConf.getTableInfo +conf.set("mapred.output.format.class", tableDesc.getOutputFileFormatClassName) + +// Add table properties from storage handler to hadoopConf, so any custom storage +// handler settings can be set to hadoopConf +HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, false) +Utilities.copyTableJobPropertiesToConf(tableDesc, conf) + +// Avoid referencing the outer object. +val fileSinkConfSer = fileSinkConf +new OutputWriterFactory { + private val jobConf = new SerializableJobConf(new JobConf(conf)) + @transient private lazy val outputFormat = + jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] + + override def getFileExtension(context: TaskAttemptContext): String = { +Utilities.getFileExtension(jobConf.value, fileSinkConfSer.getCompressed, outputFormat) + } + + override def newInstance( + path: String, + dataSchema: StructType, + context: TaskAttemptContext): OutputWriter = { +new HiveOutputWriter(path, fileSinkConfSer, jobConf.value, dataSchema) + } +} --- End diff -- Should we just create a class instead of using an anonymous class? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317211 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { --- End diff -- Want to comment the original source of code in this function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316863 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- Or, you mean that we use this function to determine if PreprocessTableInsertion has fired? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316302 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -108,35 +108,30 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { /** - * Returns the result as a hive compatible sequence of strings. For native commands, the - * execution is simply passed back to Hive. + * Returns the result as a hive compatible sequence of strings. This is for testing only. */ def hiveResultString(): Seq[String] = executedPlan match { case ExecutedCommandExec(desc: DescribeTableCommand) => - SQLExecution.withNewExecutionId(sparkSession, this) { --- End diff -- Explain the reason that `SQLExecution.withNewExecutionId(sparkSession, this)` is not needed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316982 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- Should this function actually be part of the resolved method of InsertIntoTable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317150 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None --- End diff -- Is it safe to return None? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16528 looks good to me. If possible, I'd like to get https://github.com/apache/spark/pull/16528/files#r96314156 reverted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16528: [SPARK-19148][SQL] do not expose the external tab...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16528#discussion_r96314156 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -131,17 +131,15 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR return 0; """ }, - foldFunctions = { funCalls => -funCalls.zipWithIndex.map { case (funCall, i) => - val comp = ctx.freshName("comp") - s""" -int $comp = $funCall; -if ($comp != 0) { - return $comp; -} - """ -}.mkString - }) + foldFunctions = _.map { funCall => --- End diff -- How about we revert this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16561: [SPARK-18801][SQL][FOLLOWUP] Alias the view with ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16561#discussion_r96313495 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -28,22 +28,60 @@ import org.apache.spark.sql.catalyst.rules.Rule */ /** - * Make sure that a view's child plan produces the view's output attributes. We wrap the child - * with a Project and add an alias for each output attribute. The attributes are resolved by - * name. This should be only done after the batch of Resolution, because the view attributes are - * not completely resolved during the batch of Resolution. + * Make sure that a view's child plan produces the view's output attributes. We try to wrap the + * child by: + * 1. Generate the `queryOutput` by: + *1.1. If the query column names are defined, map the column names to attributes in the child + * output by name(This is mostly for handling view queries like SELECT * FROM ..., the + * schema of the referenced table/view may change after the view has been created, so we + * have to save the output of the query to `viewQueryColumnNames`, and restore them during + * view resolution, in this way, we are able to get the correct view column ordering and + * omit the extra columns that we don't require); + *1.2. Else set the child output attributes to `queryOutput`. + * 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, + *try to up cast and alias the attribute in `queryOutput` to the attribute in the view output. + * 3. Add a Project over the child, with the new output generated by the previous steps. + * If the view output doesn't have the same number of columns neither with the child output, nor + * with the query column names, throw an AnalysisException. + * + * This should be only done after the batch of Resolution, because the view attributes are not + * completely resolved during the batch of Resolution. */ case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { -case v @ View(_, output, child) if child.resolved => +case v @ View(desc, output, child) if child.resolved && output != child.output => val resolver = conf.resolver - val newOutput = output.map { attr => -val originAttr = findAttributeByName(attr.name, child.output, resolver) -// The dataType of the output attributes may be not the same with that of the view output, -// so we should cast the attribute to the dataType of the view output attribute. If the -// cast can't perform, will throw an AnalysisException. -Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, - qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) + val queryColumnNames = desc.viewQueryColumnNames + val queryOutput = if (queryColumnNames.nonEmpty) { +// If the view output doesn't have the same number of columns with the query column names, +// throw an AnalysisException. +if (output.length != queryColumnNames.length) { + throw new AnalysisException( +s"The view output ${output.mkString("[", ",", "]")} doesn't have the same number of " + + s"columns with the query column names ${queryColumnNames.mkString("[", ",", "]")}") +} +desc.viewQueryColumnNames.map { colName => + findAttributeByName(colName, child.output, resolver) +} + } else { +// For view created before Spark 2.2.0, the view text is already fully qualified, the plan +// output is the same with the view output. +child.output + } + // Map the attributes in the query output to the attributes in the view output by index. + val newOutput = output.zip(queryOutput).map { --- End diff -- Seems we need to check the size of `output` and `queryOutput`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16597: [SPARK-19240][SQL] SET LOCATION is not allowed for table...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16597 I am not sure if it is worth breaking this behavior. If the table is a managed table, it is possible that existing behavior allows users to move a table from one managed place to another managed place (e.g. the location of a database is changed). It is not clear that breaking this behavior can give us any benefit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16568: [SPARK-18971][Core]Upgrade Netty to 4.0.43.Final
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16568 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16233 @jiangxb1987 Once jira is back, let's create jiras to address follow-up issues (probably you have already done that before jira went down). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95896683 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast can't perform, will throw an AnalysisException. +Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, + qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } --- End diff -- btw, I talked with @hvanhovell, the tricky case is when the view definition query is `select *`. So, we need first result the query and put a mapping from the view column name to the query output. Then, when we read the view back, even the column ordering of `select *` query is changed, we can still get the correct view column ordering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16233 left https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299. I think we need to address them before we switch the code path to this new approach. But, we can still merge this patch and address those in follow-up prs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95669988 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast can't perform, will throw an AnalysisException. +Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, + qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) --- End diff -- If the data type does not match, I feel we should throw an exception instead of just adding the cast at here. When the data type is changed, the meaning of the column may be changed. So, the user who defines the view may need to make some actions like modifying the applications using the view. I feel throwing an exception is better because it lets users know that the column data type is changed and some actions are required. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95662299 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the batch of Resolution, because the view attributes are + * not completely resolved during the batch of Resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = output.map { attr => +val originAttr = findAttributeByName(attr.name, child.output, resolver) +// The dataType of the output attributes may be not the same with that of the view output, +// so we should cast the attribute to the dataType of the view output attribute. If the +// cast can't perform, will throw an AnalysisException. +Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, + qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } --- End diff -- Sorry, I may miss something obvious. But, I am still not sure it is the right thing to do. I tried the following in the postgres ``` yhuai=# create table testbase (a int, b int, c int); CREATE TABLE yhuai=# insert into testbase values (1, 2, 3) ; INSERT 0 1 yhuai=# insert into testbase values (4, 5, 6); INSERT 0 1 yhuai=# create view testview (b, c, a) as select a, b, c from testbase; CREATE VIEW yhuai=# select * from testview; b | c | a ---+---+--- 1 | 2 | 3 4 | 5 | 6 (2 rows) yhuai=# create view testview1 (b, c, a) as select * from testbase; CREATE VIEW yhuai=# select * from testview1; b | c | a ---+---+--- 1 | 2 | 3 4 | 5 | 6 (2 rows) ``` I am not sure why we are resolving those columns by name. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16521: [SPARK-19139][core] New auth mechanism for transport lib...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16521 @vanzin I have not reviewed this PR yet. Just have two level questions. Is there any change to existing behaviors and settings (compared with Spark 2.1)? Also, does our doc have enough contents to explain how to set those confs and how those work? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95427571 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: +// 1. If the database part is defined in the table identifier, use that database name; +// 2. Else If the defaultDatabase is defined, use the default database name(In this case, no +//temporary objects can be used, and the default database is only used to look up a view); +// 3. Else use the currentDb of the SessionCatalog. +private def lookupTableFromCatalog( +u: UnresolvedRelation, +defaultDatabase: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +val tableIdentWithDb = u.tableIdentifier.withDataba
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95427132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // The view's child should be a logical plan parsed from the `desc.viewText`, the variable + // `viewText` should be defined, or else we throw an error on the generation of the View + // operator. + case view @ View(desc, _, child) if !child.resolved => +val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1 +val context = AnalysisContext(defaultDatabase = desc.viewDefaultDatabase, + nestedViewLevel = nestedViewLevel) +// Resolve all the UnresolvedRelations and Views in the child. +val newChild = AnalysisContext.withAnalysisContext(context) { + execute(child) +} +view.copy(child = newChild) + case p @ SubqueryAlias(_, view: View, _) => +val newChild = resolveRelation(view, defaultDatabase) +p.copy(child = newChild) + case _ => plan +} + +def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { + case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => +i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u))) + case u: UnresolvedRelation => resolveRelation(u) +} + +// Look up the table with the given name from catalog. The database we look up the table from +// is decided follow the steps: +// 1. If the database part is defined in the table identifier, use that database name; +// 2. Else If the defaultDatabase is defined, use the default database name(In this case, no +//temporary objects can be used, and the default database is only used to look up a view); +// 3. Else use the currentDb of the SessionCatalog. +private def lookupTableFromCatalog( +u: UnresolvedRelation, +defaultDatabase: Option[String] = None): LogicalPlan = { try { -catalog.lookupRelation(u.tableIdentifier, u.alias) +val tableIdentWithDb = u.tableIdentifier.withDataba
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95426835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +542,90 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => +val defaultDatabase = AnalysisContext.get.defaultDatabase +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) --- End diff -- We are calling resolveRelation again because `relation` may be a view? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95421315 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -767,19 +857,19 @@ class Analyzer( } } - /** - * In many dialects of SQL it is valid to use ordinal positions in order/sort by and group by - * clauses. This rule is to convert ordinal positions to the corresponding expressions in the - * select list. This support is introduced in Spark 2.0. - * - * - When the sort references or group by expressions are not integer but foldable expressions, - * just ignore them. - * - When spark.sql.orderByOrdinal/spark.sql.groupByOrdinal is set to false, ignore the position - * numbers too. - * - * Before the release of Spark 2.0, the literals in order/sort by and group by clauses - * have no effect on the results. - */ + /** + * In many dialects of SQL it is valid to use ordinal positions in order/sort by and group by + * clauses. This rule is to convert ordinal positions to the corresponding expressions in the + * select list. This support is introduced in Spark 2.0. + * + * - When the sort references or group by expressions are not integer but foldable expressions, + * just ignore them. + * - When spark.sql.orderByOrdinal/spark.sql.groupByOrdinal is set to false, ignore the position + * numbers too. + * + * Before the release of Spark 2.0, the literals in order/sort by and group by clauses + * have no effect on the results. + */ --- End diff -- what's going on with these lines? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95425724 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -619,7 +642,11 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat var table = inputTable -if (table.tableType != VIEW) { +if (table.tableType == VIEW) { + // Read view default database from table properties. + val viewDefaultDatabase = table.properties.get(VIEW_DEFAULT_DATABASE) --- End diff -- We may not have VIEW_DEFAULT_DATABASE in the properties for views that are defined in older versions of spark, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95422202 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) --- End diff -- When will newAttr and attr have different data types? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95424456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { --- End diff -- Since this function is used once, I'd inline it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95424381 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => --- End diff -- Is there any chance that output and child have different sizes? Also, looks like we are trying to match columns by name, can you explain the reason? Why we are not matching columns by the position? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r95424897 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import org.apache.spark.sql.AnalysisException +import org.apache.spark.sql.catalyst.CatalystConf +import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, View} +import org.apache.spark.sql.catalyst.rules.Rule + +/** + * This file defines analysis rules related to views. + */ + +/** + * Make sure that a view's child plan produces the view's output attributes. We wrap the child + * with a Project and add an alias for each output attribute. The attributes are resolved by + * name. This should be only done after the resolution batch, because the view attributes are + * not stable during resolution. + */ +case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case v @ View(_, output, child) if child.resolved => + val resolver = conf.resolver + val newOutput = child.output.map { attr => +val newAttr = findAttributeByName(attr.name, output, resolver) +// Check the dataType of the output attributes, throw an AnalysisException if they don't +// match up. +checkDataType(attr, newAttr) +Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = newAttr.qualifier, + explicitMetadata = Some(newAttr.metadata)) + } + v.copy(child = Project(newOutput, child)) + } + + /** + * Find the attribute that has the expected attribute name from an attribute list, the names + * are compared using conf.resolver. + * If the expected attribute is not found, throw an AnalysisException. + */ + private def findAttributeByName( + name: String, + attrs: Seq[Attribute], + resolver: Resolver): Attribute = { +attrs.collectFirst { + case attr if resolver(attr.name, name) => attr +}.getOrElse(throw new AnalysisException( + s"Attribute with name '$name' is not found in " + +s"'${attrs.map(_.name).mkString("(", ",", ")")}'")) + } + + /** + * Check whether the dataType of `attr` could be casted to that of `other`, throw an + * AnalysisException if the both attributes don't match up. + */ + private def checkDataType(attr: Attribute, other: Attribute): Unit = { +if (!Cast.canCast(attr.dataType, other.dataType)) { --- End diff -- btw, I do not say where we actually do the cast. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16487 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16296 Merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables
Repository: spark Updated Branches: refs/heads/master f5d18af6a -> cca945b6a [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables ## What changes were proposed in this pull request? Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source. Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for details. TODO(for follow-up PRs): 1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later. 2. `SHOW CREATE TABLE` should be updated to use the new syntax. 3. we should decide if we wanna change the behavior of `SET LOCATION`. ## How was this patch tested? new tests Author: Wenchen FanCloses #16296 from cloud-fan/create-table. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cca945b6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cca945b6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cca945b6 Branch: refs/heads/master Commit: cca945b6aa679e61864c1cabae91e6ae7703362e Parents: f5d18af Author: Wenchen Fan Authored: Thu Jan 5 17:40:27 2017 -0800 Committer: Yin Huai Committed: Thu Jan 5 17:40:27 2017 -0800 -- docs/sql-programming-guide.md | 60 +-- .../examples/sql/hive/JavaSparkHiveExample.java | 2 +- examples/src/main/python/sql/hive.py| 2 +- examples/src/main/r/RSparkSQLExample.R | 2 +- .../examples/sql/hive/SparkHiveExample.scala| 2 +- .../apache/spark/sql/catalyst/parser/SqlBase.g4 | 10 +- .../sql/catalyst/util/CaseInsensitiveMap.scala | 2 + .../spark/sql/execution/SparkSqlParser.scala| 57 +- .../spark/sql/execution/SparkStrategies.scala | 6 +- .../spark/sql/execution/command/ddl.scala | 6 +- .../spark/sql/execution/datasources/rules.scala | 7 +- .../apache/spark/sql/internal/HiveSerDe.scala | 84 +-- .../sql/execution/SparkSqlParserSuite.scala | 3 +- .../sql/execution/command/DDLCommandSuite.scala | 79 +++--- .../spark/sql/hive/HiveExternalCatalog.scala| 4 +- .../spark/sql/hive/HiveSessionState.scala | 1 + .../apache/spark/sql/hive/HiveStrategies.scala | 73 +++-- .../spark/sql/hive/execution/HiveOptions.scala | 102 ++ .../spark/sql/hive/orc/OrcFileOperator.scala| 2 +- .../spark/sql/hive/HiveDDLCommandSuite.scala| 107 +-- .../sql/hive/HiveExternalCatalogSuite.scala | 2 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 15 --- .../spark/sql/hive/execution/HiveDDLSuite.scala | 39 +++ .../sql/hive/orc/OrcHadoopFsRelationSuite.scala | 2 - 24 files changed, 526 insertions(+), 143 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cca945b6/docs/sql-programming-guide.md -- diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 4cd21ae..0f6e344 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -522,14 +522,11 @@ Hive metastore. Persistent tables will still exist even after your Spark program long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the `table` method on a `SparkSession` with the name of the table. -By default `saveAsTable` will create a "managed table", meaning that the location of the data will -be controlled by the metastore. Managed tables will also have their data deleted automatically -when a table is dropped. - -Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`. -However, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key -and location of the external table as its value (a string) when saving the table with `saveAsTable`. When an External table -is dropped only its metadata is removed. +For file-based data source, e.g. text, parquet, json, etc. you can specify a custom table path via the +`path` option, e.g. `df.write.option("path", "/some/path").saveAsTable("t")`. When the table is dropped, +the custom table path will not be removed and the table data is still there. If no custom table path is +specifed, Spark will write data to a default table path under the warehouse directory. When the table is +dropped, the default table path will be removed too. Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits: @@ -954,6 +951,53 @@ adds
[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16296 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94884267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala --- @@ -77,4 +79,16 @@ object HiveSerDe { serdeMap.get(key) } + + def getDefaultStorage(conf: SQLConf): CatalogStorageFormat = { +val defaultStorageType = conf.getConfString("hive.default.fileformat", "textfile") +val defaultHiveSerde = sourceToSerDe(defaultStorageType) +CatalogStorageFormat.empty.copy( + inputFormat = defaultHiveSerde.flatMap(_.inputFormat) +.orElse(Some("org.apache.hadoop.mapred.TextInputFormat")), + outputFormat = defaultHiveSerde.flatMap(_.outputFormat) + .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")), + serde = defaultHiveSerde.flatMap(_.serde) + .orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))) --- End diff -- Does this version break any test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16479 What is the benefit of making these changes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[1/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted
Repository: spark-website Updated Branches: refs/heads/asf-site 426a68ba8 -> 46a7a8027 http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/1-first-steps-with-spark.html -- diff --git a/site/screencasts/1-first-steps-with-spark.html b/site/screencasts/1-first-steps-with-spark.html index ac30748..8bcd8bb 100644 --- a/site/screencasts/1-first-steps-with-spark.html +++ b/site/screencasts/1-first-steps-with-spark.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/2-spark-documentation-overview.html -- diff --git a/site/screencasts/2-spark-documentation-overview.html b/site/screencasts/2-spark-documentation-overview.html index b331b25..6d8f46e 100644 --- a/site/screencasts/2-spark-documentation-overview.html +++ b/site/screencasts/2-spark-documentation-overview.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/3-transformations-and-caching.html -- diff --git a/site/screencasts/3-transformations-and-caching.html b/site/screencasts/3-transformations-and-caching.html index 7ab50f5..f7aa6b4 100644 --- a/site/screencasts/3-transformations-and-caching.html +++ b/site/screencasts/3-transformations-and-caching.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/4-a-standalone-job-in-spark.html -- diff --git a/site/screencasts/4-a-standalone-job-in-spark.html b/site/screencasts/4-a-standalone-job-in-spark.html index 35cf6f0..d6cb311 100644 --- a/site/screencasts/4-a-standalone-job-in-spark.html +++ b/site/screencasts/4-a-standalone-job-in-spark.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/index.html -- diff --git a/site/screencasts/index.html b/site/screencasts/index.html index bd9d33e..df951b8 100644 --- a/site/screencasts/index.html +++ b/site/screencasts/index.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/sitemap.xml -- diff --git a/site/sitemap.xml b/site/sitemap.xml index 47ed71f..1ed4c74 100644 --- a/site/sitemap.xml +++ b/site/sitemap.xml @@ -139,6 +139,10 @@ + http://spark.apache.org/news/spark-summit-east-2017-agenda-posted.html + weekly + + http://spark.apache.org/releases/spark-release-2-1-0.html weekly http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/sql/index.html -- diff --git a/site/sql/index.html
[3/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted
Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/46a7a802 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/46a7a802 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/46a7a802 Branch: refs/heads/asf-site Commit: 46a7a802762fa2428265b422170821fc3fec3563 Parents: 426a68b Author: Yin HuaiAuthored: Wed Jan 4 18:27:47 2017 -0800 Committer: Yin Huai Committed: Wed Jan 4 18:27:47 2017 -0800 -- ...1-04-spark-summit-east-2017-agenda-posted.md | 15 ++ site/committers.html| 6 +- site/community.html | 6 +- site/contributing.html | 6 +- site/developer-tools.html | 6 +- site/documentation.html | 6 +- site/downloads.html | 6 +- site/examples.html | 6 +- site/faq.html | 6 +- site/graphx/index.html | 6 +- site/index.html | 6 +- site/mailing-lists.html | 6 +- site/mllib/index.html | 6 +- site/news/amp-camp-2013-registration-ope.html | 6 +- .../news/announcing-the-first-spark-summit.html | 6 +- .../news/fourth-spark-screencast-published.html | 6 +- site/news/index.html| 15 +- site/news/nsdi-paper.html | 6 +- site/news/one-month-to-spark-summit-2015.html | 6 +- .../proposals-open-for-spark-summit-east.html | 6 +- ...registration-open-for-spark-summit-east.html | 6 +- .../news/run-spark-and-shark-on-amazon-emr.html | 6 +- site/news/spark-0-6-1-and-0-5-2-released.html | 6 +- site/news/spark-0-6-2-released.html | 6 +- site/news/spark-0-7-0-released.html | 6 +- site/news/spark-0-7-2-released.html | 6 +- site/news/spark-0-7-3-released.html | 6 +- site/news/spark-0-8-0-released.html | 6 +- site/news/spark-0-8-1-released.html | 6 +- site/news/spark-0-9-0-released.html | 6 +- site/news/spark-0-9-1-released.html | 6 +- site/news/spark-0-9-2-released.html | 6 +- site/news/spark-1-0-0-released.html | 6 +- site/news/spark-1-0-1-released.html | 6 +- site/news/spark-1-0-2-released.html | 6 +- site/news/spark-1-1-0-released.html | 6 +- site/news/spark-1-1-1-released.html | 6 +- site/news/spark-1-2-0-released.html | 6 +- site/news/spark-1-2-1-released.html | 6 +- site/news/spark-1-2-2-released.html | 6 +- site/news/spark-1-3-0-released.html | 6 +- site/news/spark-1-4-0-released.html | 6 +- site/news/spark-1-4-1-released.html | 6 +- site/news/spark-1-5-0-released.html | 6 +- site/news/spark-1-5-1-released.html | 6 +- site/news/spark-1-5-2-released.html | 6 +- site/news/spark-1-6-0-released.html | 6 +- site/news/spark-1-6-1-released.html | 6 +- site/news/spark-1-6-2-released.html | 6 +- site/news/spark-1-6-3-released.html | 6 +- site/news/spark-2-0-0-released.html | 6 +- site/news/spark-2-0-1-released.html | 6 +- site/news/spark-2-0-2-released.html | 6 +- site/news/spark-2-1-0-released.html | 6 +- site/news/spark-2.0.0-preview.html | 6 +- .../spark-accepted-into-apache-incubator.html | 6 +- site/news/spark-and-shark-in-the-news.html | 6 +- site/news/spark-becomes-tlp.html| 6 +- site/news/spark-featured-in-wired.html | 6 +- .../spark-mailing-lists-moving-to-apache.html | 6 +- site/news/spark-meetups.html| 6 +- site/news/spark-screencasts-published.html | 6 +- site/news/spark-summit-2013-is-a-wrap.html | 6 +- site/news/spark-summit-2014-videos-posted.html | 6 +- site/news/spark-summit-2015-videos-posted.html | 6 +- site/news/spark-summit-agenda-posted.html | 6 +- .../spark-summit-east-2015-videos-posted.html | 6 +- .../spark-summit-east-2016-cfp-closing.html | 6 +- .../spark-summit-east-2017-agenda-posted.html | 220 +++ site/news/spark-summit-east-agenda-posted.html | 6 +- .../news/spark-summit-europe-agenda-posted.html | 6 +- site/news/spark-summit-europe.html | 6 +- .../spark-summit-june-2016-agenda-posted.html | 6 +- site/news/spark-tips-from-quantifind.html | 6 +-
[2/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted
http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-2014-videos-posted.html -- diff --git a/site/news/spark-summit-2014-videos-posted.html b/site/news/spark-summit-2014-videos-posted.html index 3efd1db..03cbcd3 100644 --- a/site/news/spark-summit-2014-videos-posted.html +++ b/site/news/spark-summit-2014-videos-posted.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-2015-videos-posted.html -- diff --git a/site/news/spark-summit-2015-videos-posted.html b/site/news/spark-summit-2015-videos-posted.html index 8aed6ba..1a93256 100644 --- a/site/news/spark-summit-2015-videos-posted.html +++ b/site/news/spark-summit-2015-videos-posted.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-agenda-posted.html -- diff --git a/site/news/spark-summit-agenda-posted.html b/site/news/spark-summit-agenda-posted.html index 2697ece..354035f 100644 --- a/site/news/spark-summit-agenda-posted.html +++ b/site/news/spark-summit-agenda-posted.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2015-videos-posted.html -- diff --git a/site/news/spark-summit-east-2015-videos-posted.html b/site/news/spark-summit-east-2015-videos-posted.html index 84771e8..962aa1e 100644 --- a/site/news/spark-summit-east-2015-videos-posted.html +++ b/site/news/spark-summit-east-2015-videos-posted.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2016-cfp-closing.html -- diff --git a/site/news/spark-summit-east-2016-cfp-closing.html b/site/news/spark-summit-east-2016-cfp-closing.html index 45e6385..cc43c32 100644 --- a/site/news/spark-summit-east-2016-cfp-closing.html +++ b/site/news/spark-summit-east-2016-cfp-closing.html @@ -159,6 +159,9 @@ Latest News + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted + (Jan 04, 2017) + Spark 2.1.0 released (Dec 28, 2016) @@ -168,9 +171,6 @@ Spark 2.0.2 released (Nov 14, 2016) - Spark 1.6.3 released - (Nov 07, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2017-agenda-posted.html -- diff --git a/site/news/spark-summit-east-2017-agenda-posted.html b/site/news/spark-summit-east-2017-agenda-posted.html new file mode 100644 index 000..58af016 --- /dev/null +++ b/site/news/spark-summit-east-2017-agenda-posted.html @@ -0,0 +1,220 @@ + + + + + + + + + Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted | Apache Spark + + + + + + + + + + + + + + + + + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-32518208-2']); +
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94699866 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -18,14 +18,79 @@ package org.apache.spark.sql.hive import org.apache.spark.sql._ +import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.planning._ import org.apache.spark.sql.catalyst.plans._ import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution._ -import org.apache.spark.sql.execution.command.ExecutedCommandExec +import org.apache.spark.sql.execution.command.{DDLUtils, ExecutedCommandExec} import org.apache.spark.sql.execution.datasources.CreateTable import org.apache.spark.sql.hive.execution._ +import org.apache.spark.sql.internal.{HiveSerDe, SQLConf} + + +/** + * Determine the serde/format of the Hive serde table, according to the storage properties. + */ +class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case c @ CreateTable(t, _, _) if DDLUtils.isHiveTable(t) && t.storage.inputFormat.isEmpty => + if (t.bucketSpec.nonEmpty) { +throw new AnalysisException("Cannot create bucketed Hive serde table.") + } + + val defaultStorage: CatalogStorageFormat = { +val defaultStorageType = conf.getConfString("hive.default.fileformat", "textfile") +val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType) +CatalogStorageFormat( + locationUri = None, + inputFormat = defaultHiveSerde.flatMap(_.inputFormat) +.orElse(Some("org.apache.hadoop.mapred.TextInputFormat")), + outputFormat = defaultHiveSerde.flatMap(_.outputFormat) + .orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")), + serde = defaultHiveSerde.flatMap(_.serde) + .orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")), + compressed = false, + properties = Map()) + } + + val options = new HiveOptions(t.storage.properties) + + val fileStorage = if (options.format.isDefined) { +HiveSerDe.sourceToSerDe(options.format.get) match { + case Some(s) => +CatalogStorageFormat.empty.copy( + inputFormat = s.inputFormat, + outputFormat = s.outputFormat, + serde = s.serde) + case None => +throw new IllegalArgumentException(s"invalid format: '${options.format.get}'") +} + } else if (options.inputFormat.isDefined) { --- End diff -- Maybe we should use a helper function to know if inputFormat and outputFormat are set? The current version assumes that the reader know the internal of HiveOptions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94700062 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala --- @@ -592,4 +597,77 @@ class HiveDDLCommandSuite extends PlanTest with SQLTestUtils with TestHiveSingle val hiveClient = spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client assert(hiveClient.getConf("hive.in.test", "") == "true") } + + test("create hive serde table with new syntax - basic") { +val sql = + """ +|CREATE TABLE t +|(id int, name string COMMENT 'blabla') +|USING hive +|OPTIONS (format 'parquet', my_prop 1) +|LOCATION '/tmp/file' +|COMMENT 'BLABLA' + """.stripMargin + +val table = analyzeCreateTable(sql) +assert(table.schema == new StructType() + .add("id", "int") + .add("name", "string", nullable = true, comment = "blabla")) +assert(table.provider == Some(DDLUtils.HIVE_PROVIDER)) +assert(table.storage.locationUri == Some("/tmp/file")) +assert(table.storage.properties == Map("my_prop" -> "1")) +assert(table.comment == Some("BLABLA")) + +assert(table.storage.inputFormat == + Some("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat")) +assert(table.storage.outputFormat == + Some("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat")) +assert(table.storage.serde == + Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe")) + } + + test("create hive serde table with new syntax - with partition and bucketing") { +val v1 = "CREATE TABLE t (c1 int, c2 int) USING hive PARTITIONED BY (c2)" +val table = analyzeCreateTable(v1) +assert(table.schema == new StructType().add("c1", "int").add("c2", "int")) +assert(table.partitionColumnNames == Seq("c2")) +// check the default formats +assert(table.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")) +assert(table.storage.inputFormat == Some("org.apache.hadoop.mapred.TextInputFormat")) +assert(table.storage.outputFormat == + Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")) + +val v2 = "CREATE TABLE t (c1 int, c2 int) USING hive CLUSTERED BY (c2) INTO 4 BUCKETS" +val e = intercept[AnalysisException](analyzeCreateTable(v2)) +assert(e.message.contains("Cannot create bucketed Hive serde table")) --- End diff -- Let's also have a test using both partitioning and bucketing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94698030 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { } /** - * Create a data source table, returning a [[CreateTable]] logical plan. + * Create a table, returning a [[CreateTable]] logical plan. * * Expected format: * {{{ - * CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name + * CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name * USING table_provider * [OPTIONS table_property_list] * [PARTITIONED BY (col_name, col_name, ...)] * [CLUSTERED BY (col_name, col_name, ...) *[SORTED BY (col_name [ASC|DESC], ...)] *INTO num_buckets BUCKETS * ] + * [LOCATION path] + * [COMMENT table_comment] * [AS select_statement]; * }}} */ - override def visitCreateTableUsing(ctx: CreateTableUsingContext): LogicalPlan = withOrigin(ctx) { + override def visitCreateTable(ctx: CreateTableContext): LogicalPlan = withOrigin(ctx) { val (table, temp, ifNotExists, external) = visitCreateTableHeader(ctx.createTableHeader) if (external) { operationNotAllowed("CREATE EXTERNAL TABLE ... USING", ctx) } -val options = Option(ctx.tablePropertyList).map(visitPropertyKeyValues).getOrElse(Map.empty) +val options = Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty) val provider = ctx.tableProvider.qualifiedName.getText -if (provider.toLowerCase == DDLUtils.HIVE_PROVIDER) { - throw new AnalysisException("Cannot create hive serde table with CREATE TABLE USING") -} val schema = Option(ctx.colTypeList()).map(createSchema) val partitionColumnNames = Option(ctx.partitionColumnNames) .map(visitIdentifierList(_).toArray) .getOrElse(Array.empty[String]) val bucketSpec = Option(ctx.bucketSpec()).map(visitBucketSpec) -// TODO: this may be wrong for non file-based data source like JDBC, which should be external -// even there is no `path` in options. We should consider allow the EXTERNAL keyword. +val location = Option(ctx.locationSpec).map(visitLocationSpec) val storage = DataSource.buildStorageFormatFromOptions(options) -val tableType = if (storage.locationUri.isDefined) { + +if (location.isDefined && storage.locationUri.isDefined) { + throw new ParseException("Cannot specify LOCATION when there is 'path' in OPTIONS.", ctx) --- End diff -- Let's be more specific at here. These two approaches are the same and we only want users to use one, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94699288 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +/** + * Options for the Hive data source. + */ +class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable { + import HiveOptions._ + + def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters)) + + val format = parameters.get(FORMAT).map(_.toLowerCase) --- End diff -- file format? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94699572 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +/** + * Options for the Hive data source. + */ +class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable { + import HiveOptions._ + + def this(parameters: Map[String, String]) = this(new CaseInsensitiveMap(parameters)) + + val format = parameters.get(FORMAT).map(_.toLowerCase) + val inputFormat = parameters.get(INPUT_FORMAT) + val outputFormat = parameters.get(OUTPUT_FORMAT) + + if (inputFormat.isDefined != outputFormat.isDefined) { +throw new IllegalArgumentException("Cannot specify only inputFormat or outputFormat, you " + + "have to specify both of them.") + } + + if (format.isDefined && inputFormat.isDefined) { +throw new IllegalArgumentException("Cannot specify format and inputFormat/outputFormat " + + "together for Hive data source.") + } + + val serde = parameters.get(SERDE) + + for (f <- format if serde.isDefined) { --- End diff -- maybe using if is easier to read? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94699072 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala --- @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +/** + * Options for the Hive data source. + */ +class HiveOptions(@transient private val parameters: CaseInsensitiveMap) extends Serializable { --- End diff -- Let's also mention that DetermineHiveSerde will fill in default values based on the file format. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94700334 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1198,4 +1198,23 @@ class HiveDDLSuite assert(e.message.contains("unknown is not a valid partition column")) } } + + test("create hive serde table with new syntax") { +withTable("t", "t2") { + sql("CREATE TABLE t(id int) USING hive OPTIONS(format 'orc')") + val table = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t")) + assert(DDLUtils.isHiveTable(table)) + assert(table.storage.serde == Some("org.apache.hadoop.hive.ql.io.orc.OrcSerde")) + assert(spark.table("t").collect().isEmpty) + + sql("INSERT INTO t SELECT 1") + checkAnswer(spark.table("t"), Row(1)) + + sql("CREATE TABLE t2 USING HIVE AS SELECT 1 AS c1, 'a' AS c2") + val table2 = spark.sessionState.catalog.getTableMetadata(TableIdentifier("t2")) + assert(DDLUtils.isHiveTable(table2)) + assert(table2.storage.serde == Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")) + checkAnswer(spark.table("t2"), Row(1, "a")) +} + } --- End diff -- Let's also exercise partitioning. Let's also test a orc's option. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94697902 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { } /** - * Create a data source table, returning a [[CreateTable]] logical plan. + * Create a table, returning a [[CreateTable]] logical plan. * * Expected format: * {{{ - * CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name + * CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name --- End diff -- oh it is a typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16296#discussion_r94697485 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { } /** - * Create a data source table, returning a [[CreateTable]] logical plan. + * Create a table, returning a [[CreateTable]] logical plan. * * Expected format: * {{{ - * CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name + * CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name --- End diff -- Was `external` a typo at here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-19072][SQL] codegen of Literal should not output boxed value
Repository: spark Updated Branches: refs/heads/master b67b35f76 -> cbd11d235 [SPARK-19072][SQL] codegen of Literal should not output boxed value ## What changes were proposed in this pull request? In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result. This PR fixes this by special handling infinity to not output boxed value. ## How was this patch tested? new regression test Author: Wenchen FanCloses #16469 from cloud-fan/literal. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cbd11d23 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cbd11d23 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cbd11d23 Branch: refs/heads/master Commit: cbd11d235752d0ab30cfdbf2351cb3e68a123606 Parents: b67b35f Author: Wenchen Fan Authored: Tue Jan 3 22:40:14 2017 -0800 Committer: Yin Huai Committed: Tue Jan 3 22:40:14 2017 -0800 -- .../sql/catalyst/expressions/literals.scala | 30 +--- .../catalyst/expressions/PredicateSuite.scala | 5 2 files changed, 24 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cbd11d23/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala index ab45c41..cb0c4d3 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala @@ -266,33 +266,41 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression { override def eval(input: InternalRow): Any = value override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +val javaType = ctx.javaType(dataType) // change the isNull and primitive to consts, to inline them if (value == null) { ev.isNull = "true" - ev.copy(s"final ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)};") + ev.copy(s"final $javaType ${ev.value} = ${ctx.defaultValue(dataType)};") } else { ev.isNull = "false" - ev.value = dataType match { -case BooleanType | IntegerType | DateType => value.toString + dataType match { +case BooleanType | IntegerType | DateType => + ev.copy(code = "", value = value.toString) case FloatType => val v = value.asInstanceOf[Float] if (v.isNaN || v.isInfinite) { -ctx.addReferenceMinorObj(v) +val boxedValue = ctx.addReferenceMinorObj(v) +val code = s"final $javaType ${ev.value} = ($javaType) $boxedValue;" +ev.copy(code = code) } else { -s"${value}f" +ev.copy(code = "", value = s"${value}f") } case DoubleType => val v = value.asInstanceOf[Double] if (v.isNaN || v.isInfinite) { -ctx.addReferenceMinorObj(v) +val boxedValue = ctx.addReferenceMinorObj(v) +val code = s"final $javaType ${ev.value} = ($javaType) $boxedValue;" +ev.copy(code = code) } else { -s"${value}D" +ev.copy(code = "", value = s"${value}D") } -case ByteType | ShortType => s"(${ctx.javaType(dataType)})$value" -case TimestampType | LongType => s"${value}L" -case other => ctx.addReferenceMinorObj(value, ctx.javaType(dataType)) +case ByteType | ShortType => + ev.copy(code = "", value = s"($javaType)$value") +case TimestampType | LongType => + ev.copy(code = "", value = s"${value}L") +case other => + ev.copy(code = "", value = ctx.addReferenceMinorObj(value, ctx.javaType(dataType))) } - ev.copy("") } } http://git-wip-us.apache.org/repos/asf/spark/blob/cbd11d23/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala -- diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala index 6fc3de1..6fe295c 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala +++
[GitHub] spark issue #16469: [SPARK-19072][SQL] codegen of Literal should not output ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16469 LGTM. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15829: [SPARK-18379][SQL] Make the parallelism of parallelParti...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15829 Sure. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: Update known_translations for contributor names and also fix a small issue in translate-contributors.py
Repository: spark Updated Branches: refs/heads/master dba81e1dc -> 63036aee2 Update known_translations for contributor names and also fix a small issue in translate-contributors.py ## What changes were proposed in this pull request? This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py ## How was this patch tested? manually tested Author: Yin Huai <yh...@databricks.com> Closes #16423 from yhuai/contributors. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/63036aee Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/63036aee Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/63036aee Branch: refs/heads/master Commit: 63036aee2271cdbb7032b51b2ac67edbcb82389e Parents: dba81e1 Author: Yin Huai <yh...@databricks.com> Authored: Thu Dec 29 14:20:56 2016 -0800 Committer: Yin Huai <yh...@databricks.com> Committed: Thu Dec 29 14:20:56 2016 -0800 -- dev/create-release/known_translations| 37 +++ dev/create-release/translate-contributors.py | 4 ++- 2 files changed, 40 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/63036aee/dev/create-release/known_translations -- diff --git a/dev/create-release/known_translations b/dev/create-release/known_translations index 3563fe3..0f30990 100644 --- a/dev/create-release/known_translations +++ b/dev/create-release/known_translations @@ -165,3 +165,40 @@ stanzhai - Stan Zhai tien-dungle - Tien-Dung Le xuchenCN - Xu Chen zhangjiajin - Zhang JiaJin +ClassNotFoundExp - Fu Xing +KevinGrealish - Kevin Grealish +MasterDDT - Mitesh Patel +VinceShieh - Vincent Xie +WeichenXu123 - Weichen Xu +Yunni - Yun Ni +actuaryzhang - Wayne Zhang +alicegugu - Gu Huiqin Alice +anabranch - Bill Chambers +ashangit - Nicolas Fraison +avulanov - Alexander Ulanov +biglobster - Liang Ke +cenyuhai - Cen Yu Hai +codlife - Jianfei Wang +david-weiluo-ren - Weiluo (David) Ren +dding3 - Ding Ding +fidato13 - Tarun Kumar +frreiss - Fred Reiss +gatorsmile - Xiao Li +hayashidac - Chie Hayashida +invkrh - Hao Ren +jagadeesanas2 - Jagadeesan A S +jiangxb1987 - Jiang Xingbo +jisookim0513 - Jisoo Kim +junyangq - Junyang Qian +krishnakalyan3 - Krishna Kalyan +linbojin - Linbo Jin +mpjlu - Peng Meng +neggert - Nic Eggert +petermaxlee - Peter Lee +phalodi - Sandeep Purohit +pkch - pkch +priyankagargnitk - Priyanka Garg +sharkdtu - Sharkd Tu +shenh062326 - Shen Hong +aokolnychyi - Anton Okolnychyi +linbojin - Linbo Jin http://git-wip-us.apache.org/repos/asf/spark/blob/63036aee/dev/create-release/translate-contributors.py -- diff --git a/dev/create-release/translate-contributors.py b/dev/create-release/translate-contributors.py index 86fa02d..2cc64e4 100755 --- a/dev/create-release/translate-contributors.py +++ b/dev/create-release/translate-contributors.py @@ -147,7 +147,9 @@ print "\n== Translating contributor list === lines = contributors_file.readlines() contributions = [] for i, line in enumerate(lines): -temp_author = line.strip(" * ").split(" -- ")[0] +# It is possible that a line in the contributor file only has the github name, e.g. yhuai. +# So, we need a strip() to remove the newline. +temp_author = line.strip(" * ").split(" -- ")[0].strip() print "Processing author %s (%d/%d)" % (temp_author, i + 1, len(lines)) if not temp_author: error_msg = "ERROR: Expected the following format \" * -- \"\n" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #16423: Update known_translations for contributor names and also...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16423 Thanks. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark-website git commit: Fix the list of previous spark summits.
Repository: spark-website Updated Branches: refs/heads/asf-site e10180e67 -> 426a68ba8 Fix the list of previous spark summits. Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/426a68ba Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/426a68ba Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/426a68ba Branch: refs/heads/asf-site Commit: 426a68ba8cd6efeaffdacaa0d2b645c5c5ac6a5e Parents: e10180e Author: Yin HuaiAuthored: Thu Dec 29 13:07:47 2016 -0800 Committer: Yin Huai Committed: Thu Dec 29 13:07:47 2016 -0800 -- community.md| 19 ++- site/community.html | 19 ++- 2 files changed, 28 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/426a68ba/community.md -- diff --git a/community.md b/community.md index d887d31..d8ae250 100644 --- a/community.md +++ b/community.md @@ -88,19 +88,28 @@ Chat rooms are great for quick questions or discussions on specialized topics. T Conferences -https://spark-summit.org/;>Spark Summit Europe 2015. Oct 27 - Oct 29 in Amsterdam. +https://spark-summit.org/eu-2016/;>Spark Summit Europe 2016. Oct 25 - 27 in Brussels. -http://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 in San Francisco. +https://spark-summit.org/2016/;>Spark Summit 2016. June 6 - 8 in San Francisco. -http://spark-summit.org/east;>Spark Summit East 2015. March 18 - 19 in New York City. +https://spark-summit.org/east-2016/;>Spark Summit East 2016. Feb 16 - 18 in New York City. -http://spark-summit.org/2014;>Spark Summit 2014. June 30 - July 1 2014 in San Francisco. +https://spark-summit.org/eu-2015/;>Spark Summit Europe 2015. Oct 27 - 29 in Amsterdam. -http://spark-summit.org/2013;>Spark Summit 2013. December 2013 in San Francisco. +https://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 in San Francisco. + + +https://spark-summit.org/east-2015/;>Spark Summit East 2015. March 18 - 19 in New York City. + + +https://spark-summit.org/2014;>Spark Summit 2014. June 30 - July 1 2014 in San Francisco. + + +https://spark-summit.org/2013;>Spark Summit 2013. December 2013 in San Francisco. http://git-wip-us.apache.org/repos/asf/spark-website/blob/426a68ba/site/community.html -- diff --git a/site/community.html b/site/community.html index 7a38701..79604fb 100644 --- a/site/community.html +++ b/site/community.html @@ -286,19 +286,28 @@ and include only a few lines of the pertinent code / log within the email. Conferences -https://spark-summit.org/;>Spark Summit Europe 2015. Oct 27 - Oct 29 in Amsterdam. +https://spark-summit.org/eu-2016/;>Spark Summit Europe 2016. Oct 25 - 27 in Brussels. -http://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 in San Francisco. +https://spark-summit.org/2016/;>Spark Summit 2016. June 6 - 8 in San Francisco. -http://spark-summit.org/east;>Spark Summit East 2015. March 18 - 19 in New York City. +https://spark-summit.org/east-2016/;>Spark Summit East 2016. Feb 16 - 18 in New York City. -http://spark-summit.org/2014;>Spark Summit 2014. June 30 - July 1 2014 in San Francisco. +https://spark-summit.org/eu-2015/;>Spark Summit Europe 2015. Oct 27 - 29 in Amsterdam. -http://spark-summit.org/2013;>Spark Summit 2013. December 2013 in San Francisco. +https://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 in San Francisco. + + +https://spark-summit.org/east-2015/;>Spark Summit East 2015. March 18 - 19 in New York City. + + +https://spark-summit.org/2014;>Spark Summit 2014. June 30 - July 1 2014 in San Francisco. + + +https://spark-summit.org/2013;>Spark Summit 2013. December 2013 in San Francisco. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[2/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0
http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/releases/spark-release-0-9-1.html -- diff --git a/site/releases/spark-release-0-9-1.html b/site/releases/spark-release-0-9-1.html index 80401c4..5b08a0b 100644 --- a/site/releases/spark-release-0-9-1.html +++ b/site/releases/spark-release-0-9-1.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive @@ -210,9 +210,9 @@ Fixed hash collision bug in external spilling [https://issues.apache.org/jira/browse/SPARK-1113;>SPARK-1113] Fixed conflict with Sparkâs log4j for users relying on other logging backends [https://issues.apache.org/jira/browse/SPARK-1190;>SPARK-1190] Fixed Graphx missing from Spark assembly jar in maven builds - Fixed silent failures due to map output status exceeding Akka frame size [https://issues.apache.org/jira/browse/SPARK-1244;>SPARK-1244] - Removed Sparkâs unnecessary direct dependency on ASM [https://issues.apache.org/jira/browse/SPARK-782;>SPARK-782] - Removed metrics-ganglia from default build due to LGPL license conflict [https://issues.apache.org/jira/browse/SPARK-1167;>SPARK-1167] + Fixed silent failures due to map output status exceeding Akka frame size [https://issues.apache.org/jira/browse/SPARK-1244;>SPARK-1244] + Removed Sparkâs unnecessary direct dependency on ASM [https://issues.apache.org/jira/browse/SPARK-782;>SPARK-782] + Removed metrics-ganglia from default build due to LGPL license conflict [https://issues.apache.org/jira/browse/SPARK-1167;>SPARK-1167] Fixed bug in distribution tarball not containing spark assembly jar [https://issues.apache.org/jira/browse/SPARK-1184;>SPARK-1184] Fixed bug causing infinite NullPointerException failures due to a null in map output locations [https://issues.apache.org/jira/browse/SPARK-1124;>SPARK-1124] Fixed bugs in post-job cleanup of schedulerâs data structures @@ -228,7 +228,7 @@ Fixed bug making Spark application stall when YARN registration fails [https://issues.apache.org/jira/browse/SPARK-1032;>SPARK-1032] Race condition in getting HDFS delegation tokens in yarn-client mode [https://issues.apache.org/jira/browse/SPARK-1203;>SPARK-1203] Fixed bug in yarn-client mode not exiting properly [https://issues.apache.org/jira/browse/SPARK-1049;>SPARK-1049] - Fixed regression bug in ADD_JAR environment variable not correctly adding custom jars [https://issues.apache.org/jira/browse/SPARK-1089;>SPARK-1089] + Fixed regression bug in ADD_JAR environment variable not correctly adding custom jars [https://issues.apache.org/jira/browse/SPARK-1089;>SPARK-1089] Improvements to other deployment scenarios @@ -239,19 +239,19 @@ Optimizations to MLLib - Optimized memory usage of ALS [https://issues.apache.org/jira/browse/MLLIB-25;>MLLIB-25] + Optimized memory usage of ALS [https://issues.apache.org/jira/browse/MLLIB-25;>MLLIB-25] Optimized computation of YtY for implicit ALS [https://issues.apache.org/jira/browse/SPARK-1237;>SPARK-1237] Support for negative implicit input in ALS [https://issues.apache.org/jira/browse/MLLIB-22;>MLLIB-22] Setting of a random seed in ALS [https://issues.apache.org/jira/browse/SPARK-1238;>SPARK-1238] - Faster construction of features with intercept [https://issues.apache.org/jira/browse/SPARK-1260;>SPARK-1260] + Faster construction of features with intercept [https://issues.apache.org/jira/browse/SPARK-1260;>SPARK-1260] Check for intercept and weight in GLMâs addIntercept [https://issues.apache.org/jira/browse/SPARK-1327;>SPARK-1327] Bug fixes and better API parity for PySpark Fixed bug in Python de-pickling [https://issues.apache.org/jira/browse/SPARK-1135;>SPARK-1135] - Fixed bug in serialization of strings longer than 64K [https://issues.apache.org/jira/browse/SPARK-1043;>SPARK-1043] - Fixed bug that made jobs hang when base file is not available [https://issues.apache.org/jira/browse/SPARK-1025;>SPARK-1025] + Fixed bug in serialization of strings longer than 64K [https://issues.apache.org/jira/browse/SPARK-1043;>SPARK-1043] + Fixed bug that made jobs hang when base file is not available [https://issues.apache.org/jira/browse/SPARK-1025;>SPARK-1025] Added Missing RDD operations to PySpark - top, zip, foldByKey, repartition, coalesce, getStorageLevel, setName and
[4/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0
http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/mailing-lists.html -- diff --git a/site/mailing-lists.html b/site/mailing-lists.html index c113cdd..3e2334f 100644 --- a/site/mailing-lists.html +++ b/site/mailing-lists.html @@ -109,7 +109,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -162,6 +162,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -171,9 +174,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/mllib/index.html -- diff --git a/site/mllib/index.html b/site/mllib/index.html index e29228b..08f5dc4 100644 --- a/site/mllib/index.html +++ b/site/mllib/index.html @@ -109,7 +109,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -162,6 +162,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -171,9 +174,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/amp-camp-2013-registration-ope.html -- diff --git a/site/news/amp-camp-2013-registration-ope.html b/site/news/amp-camp-2013-registration-ope.html index b9d1aba..88d6d7d 100644 --- a/site/news/amp-camp-2013-registration-ope.html +++ b/site/news/amp-camp-2013-registration-ope.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/announcing-the-first-spark-summit.html -- diff --git a/site/news/announcing-the-first-spark-summit.html b/site/news/announcing-the-first-spark-summit.html index 2215895..0c013dc 100644 --- a/site/news/announcing-the-first-spark-summit.html +++ b/site/news/announcing-the-first-spark-summit.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/fourth-spark-screencast-published.html -- diff --git a/site/news/fourth-spark-screencast-published.html b/site/news/fourth-spark-screencast-published.html index fe28ecf..efa74d0 100644 --- a/site/news/fourth-spark-screencast-published.html +++ b/site/news/fourth-spark-screencast-published.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the
[1/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0
Repository: spark-website Updated Branches: refs/heads/asf-site d2bcf1854 -> e10180e67 http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/releases/spark-release-2-1-0.html -- diff --git a/site/releases/spark-release-2-1-0.html b/site/releases/spark-release-2-1-0.html new file mode 100644 index 000..53017ff --- /dev/null +++ b/site/releases/spark-release-2-1-0.html @@ -0,0 +1,370 @@ + + + + + + + + + Spark Release 2.1.0 | Apache Spark + + + + + + + + + + + + + + + + + var _gaq = _gaq || []; + _gaq.push(['_setAccount', 'UA-32518208-2']); + _gaq.push(['_trackPageview']); + (function() { +var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; +ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; +var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); + })(); + + + function trackOutboundLink(link, category, action) { +try { + _gaq.push(['_trackEvent', category , action]); +} catch(err){} + +setTimeout(function() { + document.location.href = link.href; +}, 100); + } + + + + + + + + +https://code.jquery.com/jquery.js"> +https://netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"> + + + + + + + + + + + Lightning-fast cluster computing + + + + + + + + + + Toggle navigation + + + + + + + + + + Download + + + Libraries + + + SQL and DataFrames + Spark Streaming + MLlib (machine learning) + GraphX (graph) + + Third-Party Projects + + + + + Documentation + + + Latest Release (Spark 2.1.0) + Older Versions and Other Resources + Frequently Asked Questions + + + Examples + + + Community + + + Mailing Lists Resources + Contributing to Spark + https://issues.apache.org/jira/browse/SPARK;>Issue Tracker + Powered By + Project Committers + + + + + Developers + + + Useful Developer Tools + Versioning Policy + Release Process + + + + + +http://www.apache.org/; class="dropdown-toggle" data-toggle="dropdown"> + Apache Software Foundation + + http://www.apache.org/;>Apache Homepage + http://www.apache.org/licenses/;>License + http://www.apache.org/foundation/sponsorship.html;>Sponsorship + http://www.apache.org/foundation/thanks.html;>Thanks + http://www.apache.org/security/;>Security + + + + + + + + + + + + Latest News + + + Spark 2.1.0 released + (Dec 28, 2016) + + Spark wins CloudSort Benchmark as the most efficient engine + (Nov 15, 2016) + + Spark 2.0.2 released + (Nov 14, 2016) + + Spark 1.6.3 released + (Nov 07, 2016) + + + Archive + + + +Download Spark + + +Built-in Libraries: + + +SQL and DataFrames +Spark Streaming +MLlib (machine learning) +GraphX (graph) + + Third-Party Projects + + + + +Spark Release 2.1.0 + + +Apache Spark 2.1.0 is the second release on the 2.x line. This release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks and Kafka 0.10 support. In addition, this release focuses more on usability, stability, and polish, resolving over 1200 tickets. + +To download Apache Spark 2.1.0, visit the downloads page. You can consult JIRA for the https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420version=12335644;>detailed changes. We have curated a list of high level changes here, grouped by major modules. + + + Core and Spark SQL + Structured Streaming + MLlib + SparkR + GraphX + Deprecations + Changes of behavior + Known Issues + Credits + + +Core and Spark SQL + + + API updates + + SPARK-17864: Data type APIs are stable APIs. + SPARK-18351: from_json and to_json for parsing JSON for string columns + SPARK-16700: When creating a DataFrame in PySpark, Python dictionaries can be used as values of a StructType. + + + Performance and stability + +
[5/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0
Update Spark website for the release of Apache Spark 2.1.0 Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/e10180e6 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/e10180e6 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/e10180e6 Branch: refs/heads/asf-site Commit: e10180e6784e1d9d8771ef42481687aec0a423a2 Parents: d2bcf18 Author: Yin HuaiAuthored: Wed Dec 28 18:00:05 2016 -0800 Committer: Yin Huai Committed: Thu Dec 29 07:46:11 2016 -0800 -- _layouts/global.html| 2 +- documentation.md| 1 + downloads.md| 6 +- js/downloads.js | 1 + news/_posts/2016-12-28-spark-2-1-0-released.md | 14 + .../_posts/2016-12-28-spark-release-2-1-0.md| 120 ++ site/committers.html| 48 ++- site/community.html | 16 +- site/contributing.html | 28 +- site/developer-tools.html | 22 +- site/docs/latest| 2 +- site/documentation.html | 14 +- site/downloads.html | 14 +- site/examples.html | 100 ++--- site/faq.html | 8 +- site/graphx/index.html | 8 +- site/index.html | 8 +- site/js/downloads.js| 1 + site/mailing-lists.html | 8 +- site/mllib/index.html | 8 +- site/news/amp-camp-2013-registration-ope.html | 8 +- .../news/announcing-the-first-spark-summit.html | 8 +- .../news/fourth-spark-screencast-published.html | 8 +- site/news/index.html| 27 +- site/news/nsdi-paper.html | 8 +- site/news/one-month-to-spark-summit-2015.html | 8 +- .../proposals-open-for-spark-summit-east.html | 8 +- ...registration-open-for-spark-summit-east.html | 8 +- .../news/run-spark-and-shark-on-amazon-emr.html | 8 +- site/news/spark-0-6-1-and-0-5-2-released.html | 8 +- site/news/spark-0-6-2-released.html | 8 +- site/news/spark-0-7-0-released.html | 8 +- site/news/spark-0-7-2-released.html | 8 +- site/news/spark-0-7-3-released.html | 8 +- site/news/spark-0-8-0-released.html | 8 +- site/news/spark-0-8-1-released.html | 8 +- site/news/spark-0-9-0-released.html | 8 +- site/news/spark-0-9-1-released.html | 10 +- site/news/spark-0-9-2-released.html | 10 +- site/news/spark-1-0-0-released.html | 8 +- site/news/spark-1-0-1-released.html | 8 +- site/news/spark-1-0-2-released.html | 8 +- site/news/spark-1-1-0-released.html | 10 +- site/news/spark-1-1-1-released.html | 8 +- site/news/spark-1-2-0-released.html | 8 +- site/news/spark-1-2-1-released.html | 8 +- site/news/spark-1-2-2-released.html | 10 +- site/news/spark-1-3-0-released.html | 8 +- site/news/spark-1-4-0-released.html | 8 +- site/news/spark-1-4-1-released.html | 8 +- site/news/spark-1-5-0-released.html | 8 +- site/news/spark-1-5-1-released.html | 8 +- site/news/spark-1-5-2-released.html | 8 +- site/news/spark-1-6-0-released.html | 8 +- site/news/spark-1-6-1-released.html | 8 +- site/news/spark-1-6-2-released.html | 8 +- site/news/spark-1-6-3-released.html | 8 +- site/news/spark-2-0-0-released.html | 8 +- site/news/spark-2-0-1-released.html | 8 +- site/news/spark-2-0-2-released.html | 8 +- site/news/spark-2-1-0-released.html | 220 +++ site/news/spark-2.0.0-preview.html | 8 +- .../spark-accepted-into-apache-incubator.html | 8 +- site/news/spark-and-shark-in-the-news.html | 10 +- site/news/spark-becomes-tlp.html| 8 +- site/news/spark-featured-in-wired.html | 8 +- .../spark-mailing-lists-moving-to-apache.html | 8 +- site/news/spark-meetups.html| 8 +- site/news/spark-screencasts-published.html | 8 +- site/news/spark-summit-2013-is-a-wrap.html | 8 +- site/news/spark-summit-2014-videos-posted.html | 8 +- site/news/spark-summit-2015-videos-posted.html | 8 +- site/news/spark-summit-agenda-posted.html | 8 +- .../spark-summit-east-2015-videos-posted.html | 10 +-
[3/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0
http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-2.0.0-preview.html -- diff --git a/site/news/spark-2.0.0-preview.html b/site/news/spark-2.0.0-preview.html index 64acf16..f135bf2 100644 --- a/site/news/spark-2.0.0-preview.html +++ b/site/news/spark-2.0.0-preview.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-accepted-into-apache-incubator.html -- diff --git a/site/news/spark-accepted-into-apache-incubator.html b/site/news/spark-accepted-into-apache-incubator.html index 57e4881..257e17e 100644 --- a/site/news/spark-accepted-into-apache-incubator.html +++ b/site/news/spark-accepted-into-apache-incubator.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-and-shark-in-the-news.html -- diff --git a/site/news/spark-and-shark-in-the-news.html b/site/news/spark-and-shark-in-the-news.html index 3994fe0..730a667 100644 --- a/site/news/spark-and-shark-in-the-news.html +++ b/site/news/spark-and-shark-in-the-news.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released + (Dec 28, 2016) + Spark wins CloudSort Benchmark as the most efficient engine (Nov 15, 2016) @@ -168,9 +171,6 @@ Spark 1.6.3 released (Nov 07, 2016) - Spark 2.0.1 released - (Oct 03, 2016) - Archive @@ -205,7 +205,7 @@ http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/;>DataInformed interviewed two Spark users and wrote about their applications in anomaly detection, predictive analytics and data mining. -In other news, there will be a full day of tutorials on Spark and Shark at the http://strataconf.com/strata2013;>OReilly Strata conference in February. They include a three-hour http://strataconf.com/strata2013/public/schedule/detail/27438;>introduction to Spark, Shark and BDAS Tuesday morning, and a three-hour http://strataconf.com/strata2013/public/schedule/detail/27440;>hands-on exercise session. +In other news, there will be a full day of tutorials on Spark and Shark at the http://strataconf.com/strata2013;>OReilly Strata conference in February. They include a three-hour http://strataconf.com/strata2013/public/schedule/detail/27438;>introduction to Spark, Shark and BDAS Tuesday morning, and a three-hour http://strataconf.com/strata2013/public/schedule/detail/27440;>hands-on exercise session. http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-becomes-tlp.html -- diff --git a/site/news/spark-becomes-tlp.html b/site/news/spark-becomes-tlp.html index 803c919..7f6d730 100644 --- a/site/news/spark-becomes-tlp.html +++ b/site/news/spark-becomes-tlp.html @@ -106,7 +106,7 @@ Documentation - Latest Release (Spark 2.0.2) + Latest Release (Spark 2.1.0) Older Versions and Other Resources Frequently Asked Questions @@ -159,6 +159,9 @@ Latest News + Spark 2.1.0 released +
[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16233 Seems it is good to know what issues need to be addressed before we can switch to this new approach. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16233 Changes look good to me. @gatorsmile @hvanhovell what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r94104622 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +510,93 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u @ UnresolvedRelation(table: TableIdentifier, _) if isRunningDirectlyOnFiles(table) => +u + case u: UnresolvedRelation => +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // Hive support is required to resolve a persistent view, the logical plan returned by + // catalog.lookupRelation() should be: --- End diff -- oh, we are not parsing view text when we do not have hive support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r94103444 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +510,93 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default --- End diff -- oh, nvm, even if we have `database.viewname`, dbname and viewname will be inside the table identifier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r94103404 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +510,93 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default --- End diff -- btw, will we allow cases like `CREATE VIEW database.viewname`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand
Repository: spark Updated Branches: refs/heads/master 93f35569f -> 7d19b6ab7 [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand ## What changes were proposed in this pull request? The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a lot of work to do if the table already exists: 1. throw exception if we don't want to ignore it. 2. do some check and adjust the schema if we want to append data. 3. drop the table and create it again if we want to overwrite. The work 2 and 3 should be done by analyzer, so that we can also apply it to hive tables. ## How was this patch tested? existing tests. Author: Wenchen FanCloses #15996 from cloud-fan/append. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d19b6ab Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d19b6ab Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d19b6ab Branch: refs/heads/master Commit: 7d19b6ab7d75b95d9eb1c7e1f228d23fd482306e Parents: 93f3556 Author: Wenchen Fan Authored: Wed Dec 28 21:50:21 2016 -0800 Committer: Yin Huai Committed: Wed Dec 28 21:50:21 2016 -0800 -- .../org/apache/spark/sql/DataFrameWriter.scala | 78 + .../command/createDataSourceTables.scala| 167 +-- .../spark/sql/execution/datasources/rules.scala | 164 +- .../sql/hive/MetastoreDataSourcesSuite.scala| 2 +- 4 files changed, 213 insertions(+), 198 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7d19b6ab/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index 9c5660a..405f38a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -23,11 +23,12 @@ import scala.collection.JavaConverters._ import org.apache.spark.annotation.InterfaceStability import org.apache.spark.sql.catalyst.TableIdentifier -import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, UnresolvedRelation} import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, CatalogTableType} import org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable import org.apache.spark.sql.execution.command.DDLUtils -import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource} +import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, LogicalRelation} +import org.apache.spark.sql.sources.BaseRelation import org.apache.spark.sql.types.StructType /** @@ -364,7 +365,11 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { throw new AnalysisException("Cannot create hive serde table with saveAsTable API") } -val tableExists = df.sparkSession.sessionState.catalog.tableExists(tableIdent) +val catalog = df.sparkSession.sessionState.catalog +val tableExists = catalog.tableExists(tableIdent) +val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase) +val tableIdentWithDB = tableIdent.copy(database = Some(db)) +val tableName = tableIdentWithDB.unquotedString (tableExists, mode) match { case (true, SaveMode.Ignore) => @@ -373,39 +378,48 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { case (true, SaveMode.ErrorIfExists) => throw new AnalysisException(s"Table $tableIdent already exists.") - case _ => -val existingTable = if (tableExists) { - Some(df.sparkSession.sessionState.catalog.getTableMetadata(tableIdent)) -} else { - None + case (true, SaveMode.Overwrite) => +// Get all input data source relations of the query. +val srcRelations = df.logicalPlan.collect { + case LogicalRelation(src: BaseRelation, _, _) => src } -val storage = if (tableExists) { - existingTable.get.storage -} else { - DataSource.buildStorageFormatFromOptions(extraOptions.toMap) -} -val tableType = if (tableExists) { - existingTable.get.tableType -} else if (storage.locationUri.isDefined) { - CatalogTableType.EXTERNAL -} else { - CatalogTableType.MANAGED +EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) match { + // Only do the check if the table is a data source table (the relation is a BaseRelation). + case LogicalRelation(dest: BaseRelation, _, _) if
[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15996 LGTM. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16233#discussion_r94102704 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -510,32 +510,93 @@ class Analyzer( * Replaces [[UnresolvedRelation]]s with concrete relations from the catalog. */ object ResolveRelations extends Rule[LogicalPlan] { -private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan = { + +// If the unresolved relation is running directly on files, we just return the original +// UnresolvedRelation, the plan will get resolved later. Else we look up the table from catalog +// and change the default database name if it is a view. +// We usually look up a table from the default database if the table identifier has an empty +// database part, for a view the default database should be the currentDb when the view was +// created. When the case comes to resolving a nested view, the view may have different default +// database with that the referenced view has, so we need to use the variable `defaultDatabase` +// to track the current default database. +// When the relation we resolve is a view, we fetch the view.desc(which is a CatalogTable), and +// then set the value of `CatalogTable.viewDefaultDatabase` to the variable `defaultDatabase`, +// we look up the relations that the view references using the default database. +// For example: +// |- view1 (defaultDatabase = db1) +// |- operator +// |- table2 (defaultDatabase = db1) +// |- view2 (defaultDatabase = db2) +//|- view3 (defaultDatabase = db3) +// |- view4 (defaultDatabase = db4) +// In this case, the view `view1` is a nested view, it directly references `table2`ã`view2` +// and `view4`, the view `view2` references `view3`. On resolving the table, we look up the +// relations `table2`ã`view2`ã`view4` using the default database `db1`, and look up the +// relation `view3` using the default database `db2`. +// +// Note this is compatible with the views defined by older versions of Spark(before 2.2), which +// have empty defaultDatabase and all the relations in viewText have database part defined. +def resolveRelation( +plan: LogicalPlan, +defaultDatabase: Option[String] = None): LogicalPlan = plan match { + case u @ UnresolvedRelation(table: TableIdentifier, _) if isRunningDirectlyOnFiles(table) => +u + case u: UnresolvedRelation => +val relation = lookupTableFromCatalog(u, defaultDatabase) +resolveRelation(relation, defaultDatabase) + // Hive support is required to resolve a persistent view, the logical plan returned by + // catalog.lookupRelation() should be: --- End diff -- Sorry. I probably missed something. A persistent view is stored in the external catalog. So, we can always have persistent views, right? (we have a InMemoryCatalog, which is another external catalog. It is not very useful though. But it is still an external catalog) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[06/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sql-programming-guide.html -- diff --git a/site/docs/2.1.0/sql-programming-guide.html b/site/docs/2.1.0/sql-programming-guide.html index 17f5981..4534a98 100644 --- a/site/docs/2.1.0/sql-programming-guide.html +++ b/site/docs/2.1.0/sql-programming-guide.html @@ -127,95 +127,95 @@ - Overview - SQL - Datasets and DataFrames + Overview + SQL + Datasets and DataFrames - Getting Started - Starting Point: SparkSession - Creating DataFrames - Untyped Dataset Operations (aka DataFrame Operations) - Running SQL Queries Programmatically - Global Temporary View - Creating Datasets - Interoperating with RDDs - Inferring the Schema Using Reflection - Programmatically Specifying the Schema + Getting Started + Starting Point: SparkSession + Creating DataFrames + Untyped Dataset Operations (aka DataFrame Operations) + Running SQL Queries Programmatically + Global Temporary View + Creating Datasets + Interoperating with RDDs + Inferring the Schema Using Reflection + Programmatically Specifying the Schema - Data Sources - Generic Load/Save Functions - Manually Specifying Options - Run SQL on files directly - Save Modes - Saving to Persistent Tables + Data Sources + Generic Load/Save Functions + Manually Specifying Options + Run SQL on files directly + Save Modes + Saving to Persistent Tables - Parquet Files - Loading Data Programmatically - Partition Discovery - Schema Merging - Hive metastore Parquet table conversion - Hive/Parquet Schema Reconciliation - Metadata Refreshing + Parquet Files + Loading Data Programmatically + Partition Discovery + Schema Merging + Hive metastore Parquet table conversion + Hive/Parquet Schema Reconciliation + Metadata Refreshing - Configuration + Configuration - JSON Datasets - Hive Tables - Interacting with Different Versions of Hive Metastore + JSON Datasets + Hive Tables + Interacting with Different Versions of Hive Metastore - JDBC To Other Databases - Troubleshooting + JDBC To Other Databases + Troubleshooting - Performance Tuning - Caching Data In Memory - Other Configuration Options + Performance Tuning + Caching Data In Memory + Other Configuration Options - Distributed SQL Engine - Running the Thrift JDBC/ODBC server - Running the Spark SQL CLI + Distributed SQL Engine + Running the Thrift JDBC/ODBC server + Running the Spark SQL CLI - Migration Guide - Upgrading From Spark SQL 2.0 to 2.1 - Upgrading From Spark SQL 1.6 to 2.0 - Upgrading From Spark SQL 1.5 to 1.6 - Upgrading From Spark SQL 1.4 to 1.5 - Upgrading from Spark SQL 1.3 to 1.4 - DataFrame data reader/writer interface - DataFrame.groupBy retains grouping columns - Behavior change on DataFrame.withColumn + Migration Guide + Upgrading From Spark SQL 2.0 to 2.1 + Upgrading From Spark SQL 1.6 to 2.0 + Upgrading From Spark SQL 1.5 to 1.6 + Upgrading From Spark SQL 1.4 to 1.5 + Upgrading from Spark SQL 1.3 to 1.4 + DataFrame data reader/writer interface + DataFrame.groupBy retains grouping columns + Behavior change on DataFrame.withColumn - Upgrading from Spark SQL 1.0-1.2 to 1.3 - Rename of SchemaRDD to DataFrame - Unification of the Java and Scala APIs - Isolation of Implicit Conversions and Removal of dsl Package (Scala-only) - Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only) - UDF Registration Moved to sqlContext.udf (Java Scala) - Python DataTypes No Longer Singletons + Upgrading from Spark SQL 1.0-1.2 to 1.3 + Rename of SchemaRDD to DataFrame + Unification of the Java and Scala APIs + Isolation of Implicit Conversions and Removal of dsl Package (Scala-only) + Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only) + UDF Registration Moved to sqlContext.udf (Java Scala) + Python DataTypes No Longer
[19/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-migration-guides.html -- diff --git a/site/docs/2.1.0/ml-migration-guides.html b/site/docs/2.1.0/ml-migration-guides.html index 5e8a913..24dfc31 100644 --- a/site/docs/2.1.0/ml-migration-guides.html +++ b/site/docs/2.1.0/ml-migration-guides.html @@ -344,21 +344,21 @@ for converting to mllib.linalg types. -import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.mllib.util.MLUtils // convert DataFrame columns val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) // convert a single vector or matrix val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML -val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML +val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML Refer to the MLUtils Scala docs for further detail. -import org.apache.spark.mllib.util.MLUtils; +import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.sql.Dataset; // convert DataFrame columns @@ -366,21 +366,21 @@ for converting to mllib.linalg types. DatasetRow convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); // convert a single vector or matrix org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML(); -org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML(); +org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML(); Refer to the MLUtils Java docs for further detail. -from pyspark.mllib.util import MLUtils +from pyspark.mllib.util import MLUtils -# convert DataFrame columns +# convert DataFrame columns convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) -# convert a single vector or matrix +# convert a single vector or matrix mlVec = mllibVec.asML() -mlMat = mllibMat.asML() +mlMat = mllibMat.asML() Refer to the MLUtils Python docs for further detail. http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-pipeline.html -- diff --git a/site/docs/2.1.0/ml-pipeline.html b/site/docs/2.1.0/ml-pipeline.html index fe17564..b57afde 100644 --- a/site/docs/2.1.0/ml-pipeline.html +++ b/site/docs/2.1.0/ml-pipeline.html @@ -331,27 +331,27 @@ machine learning pipelines. Table of Contents - Main concepts in Pipelines - DataFrame - Pipeline components - Transformers - Estimators - Properties of pipeline components + Main concepts in Pipelines + DataFrame + Pipeline components + Transformers + Estimators + Properties of pipeline components - Pipeline - How it works - Details + Pipeline + How it works + Details - Parameters - Saving and Loading Pipelines + Parameters + Saving and Loading Pipelines - Code examples - Example: Estimator, Transformer, and Param - Example: Pipeline - Model selection (hyperparameter tuning) + Code examples + Example: Estimator, Transformer, and Param + Example: Pipeline + Model selection (hyperparameter tuning) @@ -541,7 +541,7 @@ Refer to the [`Estimator` Scala docs](api/scala/index.html#org.apache.spark.ml.E the [`Transformer` Scala docs](api/scala/index.html#org.apache.spark.ml.Transformer) and the [`Params` Scala docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the API. -import org.apache.spark.ml.classification.LogisticRegression +import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.linalg.{Vector, Vectors} import org.apache.spark.ml.param.ParamMap import org.apache.spark.sql.Row @@ -601,7 +601,7 @@ the [`Params` Scala docs](api/scala/index.html#org.apache.spark.ml.param.Params) .select(features, label, myProbability, prediction) .collect() .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) = -println(s($features, $label) - prob=$prob, prediction=$prediction) +println(s($features, $label) - prob=$prob, prediction=$prediction) } Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala" in the Spark repo. @@ -612,7 +612,7 @@ Refer to the [`Estimator` Java docs](api/java/org/apache/spark/ml/Estimator.html the [`Transformer` Java docs](api/java/org/apache/spark/ml/Transformer.html) and the [`Params` Java docs](api/java/org/apache/spark/ml/param/Params.html) for details on the API. -import java.util.Arrays; +import java.util.Arrays; import java.util.List; import
[01/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
Repository: spark-website Updated Branches: refs/heads/asf-site ecf94f284 -> d2bcf1854 http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/submitting-applications.html -- diff --git a/site/docs/2.1.0/submitting-applications.html b/site/docs/2.1.0/submitting-applications.html index fc18fa9..0c91739 100644 --- a/site/docs/2.1.0/submitting-applications.html +++ b/site/docs/2.1.0/submitting-applications.html @@ -151,14 +151,14 @@ packaging them into a .zip or .egg. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports: -./bin/spark-submit \ +./bin/spark-submit \ --class main-class \ --master master-url \ --deploy-mode deploy-mode \ --conf key=value \ - ... # other options + ... # other options application-jar \ - [application-arguments] + [application-arguments] Some of the commonly used options are: @@ -194,23 +194,23 @@ you can also specify --supervise to make sure that the driver is au fails with non-zero exit code. To enumerate all such options available to spark-submit, run it with --help. Here are a few examples of common options: -# Run application locally on 8 cores +# Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ - --master local[8] \ + --master local[8] \ /path/to/examples.jar \ - 100 + 100 -# Run on a Spark standalone cluster in client deploy mode +# Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ - 1000 + 1000 -# Run on a Spark standalone cluster in cluster deploy mode with supervise +# Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ @@ -219,26 +219,26 @@ run it with --help. Here are a few examples of common options: --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ - 1000 + 1000 -# Run on a YARN cluster -export HADOOP_CONF_DIR=XXX +# Run on a YARN cluster +export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ - --deploy-mode cluster \ # can be client for client mode + --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ - 1000 + 1000 -# Run a Python application on a Spark standalone cluster +# Run a Python application on a Spark standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ - 1000 + 1000 -# Run on a Mesos cluster in cluster deploy mode with supervise +# Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master mesos://207.184.161.138:7077 \ @@ -247,7 +247,7 @@ run it with --help. Here are a few examples of common options: --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ - 1000 + 1000 Master URLs http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/tuning.html -- diff --git a/site/docs/2.1.0/tuning.html b/site/docs/2.1.0/tuning.html index ca4ad9f..33a6316 100644 --- a/site/docs/2.1.0/tuning.html +++ b/site/docs/2.1.0/tuning.html @@ -129,23 +129,23 @@ - Data Serialization - Memory Tuning - Memory Management Overview - Determining Memory Consumption - Tuning Data Structures - Serialized RDD Storage - Garbage Collection Tuning + Data Serialization + Memory Tuning + Memory Management Overview + Determining Memory Consumption + Tuning Data Structures + Serialized RDD Storage + Garbage Collection Tuning - Other Considerations - Level of Parallelism - Memory Usage of Reduce Tasks - Broadcasting Large Variables - Data Locality + Other Considerations + Level of Parallelism + Memory Usage of Reduce Tasks + Broadcasting Large Variables + Data Locality - Summary + Summary Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked @@ -194,9 +194,9 @@ in the AllScalaRegistrar from the https://github.com/twitter/chill;>Twi To register your own custom classes with Kryo, use the registerKryoClasses method. -val conf = new SparkConf().setMaster(...).setAppName(...) +val conf = new SparkConf().setMaster(...).setAppName(...)
[25/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294 This version is built from the docs source code generated by applying https://github.com/apache/spark/pull/16294 to v2.1.0 (so, other changes in branch 2.1 will not affect the doc). Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/d2bcf185 Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/d2bcf185 Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/d2bcf185 Branch: refs/heads/asf-site Commit: d2bcf1854b0e0409495e2f1d3c6beaad923f6e6b Parents: ecf94f2 Author: Yin HuaiAuthored: Wed Dec 28 14:32:43 2016 -0800 Committer: Yin Huai Committed: Wed Dec 28 14:32:43 2016 -0800 -- site/docs/2.1.0/building-spark.html | 46 +- site/docs/2.1.0/building-with-maven.html| 14 +- site/docs/2.1.0/configuration.html | 52 +- site/docs/2.1.0/ec2-scripts.html| 174 site/docs/2.1.0/graphx-programming-guide.html | 198 ++--- site/docs/2.1.0/hadoop-provided.html| 14 +- .../img/structured-streaming-watermark.png | Bin 0 -> 252000 bytes site/docs/2.1.0/img/structured-streaming.pptx | Bin 1105413 -> 1113902 bytes site/docs/2.1.0/job-scheduling.html | 40 +- site/docs/2.1.0/ml-advanced.html| 10 +- .../2.1.0/ml-classification-regression.html | 838 +- site/docs/2.1.0/ml-clustering.html | 124 +-- site/docs/2.1.0/ml-collaborative-filtering.html | 56 +- site/docs/2.1.0/ml-features.html| 764 site/docs/2.1.0/ml-migration-guides.html| 16 +- site/docs/2.1.0/ml-pipeline.html| 178 ++-- site/docs/2.1.0/ml-tuning.html | 172 ++-- site/docs/2.1.0/mllib-clustering.html | 186 ++-- .../2.1.0/mllib-collaborative-filtering.html| 48 +- site/docs/2.1.0/mllib-data-types.html | 208 ++--- site/docs/2.1.0/mllib-decision-tree.html| 94 +- .../2.1.0/mllib-dimensionality-reduction.html | 28 +- site/docs/2.1.0/mllib-ensembles.html| 182 ++-- site/docs/2.1.0/mllib-evaluation-metrics.html | 302 +++ site/docs/2.1.0/mllib-feature-extraction.html | 122 +-- .../2.1.0/mllib-frequent-pattern-mining.html| 28 +- site/docs/2.1.0/mllib-isotonic-regression.html | 38 +- site/docs/2.1.0/mllib-linear-methods.html | 174 ++-- site/docs/2.1.0/mllib-naive-bayes.html | 24 +- site/docs/2.1.0/mllib-optimization.html | 50 +- site/docs/2.1.0/mllib-pmml-model-export.html| 35 +- site/docs/2.1.0/mllib-statistics.html | 180 ++-- site/docs/2.1.0/programming-guide.html | 302 +++ site/docs/2.1.0/quick-start.html| 166 ++-- site/docs/2.1.0/running-on-mesos.html | 52 +- site/docs/2.1.0/running-on-yarn.html| 27 +- site/docs/2.1.0/spark-standalone.html | 30 +- site/docs/2.1.0/sparkr.html | 145 ++-- site/docs/2.1.0/sql-programming-guide.html | 819 +- site/docs/2.1.0/storage-openstack-swift.html| 12 +- site/docs/2.1.0/streaming-custom-receivers.html | 26 +- .../2.1.0/streaming-kafka-0-10-integration.html | 52 +- .../docs/2.1.0/streaming-programming-guide.html | 416 - .../structured-streaming-kafka-integration.html | 44 +- .../structured-streaming-programming-guide.html | 864 --- site/docs/2.1.0/submitting-applications.html| 36 +- site/docs/2.1.0/tuning.html | 30 +- 47 files changed, 3926 insertions(+), 3490 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/building-spark.html -- diff --git a/site/docs/2.1.0/building-spark.html b/site/docs/2.1.0/building-spark.html index b3a720c..5c20245 100644 --- a/site/docs/2.1.0/building-spark.html +++ b/site/docs/2.1.0/building-spark.html @@ -127,33 +127,33 @@ - Building Apache Spark - Apache Maven - Setting up Mavens Memory Usage - build/mvn + Building Apache Spark + Apache Maven + Setting up Mavens Memory Usage + build/mvn - Building a Runnable Distribution - Specifying the Hadoop Version - Building With Hive and JDBC Support - Packaging without Hadoop Dependencies for YARN - Building with Mesos support - Building for Scala 2.10 - Building submodules individually - Continuous Compilation - Speeding up Compilation with Zinc - Building with SBT - Â Encrypted
[11/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-linear-methods.html -- diff --git a/site/docs/2.1.0/mllib-linear-methods.html b/site/docs/2.1.0/mllib-linear-methods.html index 46a1a25..428d778 100644 --- a/site/docs/2.1.0/mllib-linear-methods.html +++ b/site/docs/2.1.0/mllib-linear-methods.html @@ -307,23 +307,23 @@ - Mathematical formulation - Loss functions - Regularizers - Optimization + Mathematical formulation + Loss functions + Regularizers + Optimization - Classification - Linear Support Vector Machines (SVMs) - Logistic regression + Classification + Linear Support Vector Machines (SVMs) + Logistic regression - Regression - Linear least squares, Lasso, and ridge regression - Streaming linear regression + Regression + Linear least squares, Lasso, and ridge regression + Streaming linear regression - Implementation (developer) + Implementation (developer) \[ @@ -489,7 +489,7 @@ error. Refer to the SVMWithSGD Scala docs and SVMModel Scala docs for details on the API. -import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} +import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.util.MLUtils @@ -534,14 +534,14 @@ this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations. -import org.apache.spark.mllib.optimization.L1Updater +import org.apache.spark.mllib.optimization.L1Updater val svmAlg = new SVMWithSGD() svmAlg.optimizer .setNumIterations(200) .setRegParam(0.1) .setUpdater(new L1Updater) -val modelL1 = svmAlg.run(training) +val modelL1 = svmAlg.run(training) @@ -554,7 +554,7 @@ that is equivalent to the provided example in Scala is given below: Refer to the SVMWithSGD Java docs and SVMModel Java docs for details on the API. -import scala.Tuple2; +import scala.Tuple2; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; @@ -591,7 +591,7 @@ that is equivalent to the provided example in Scala is given below: // Get evaluation metrics. BinaryClassificationMetrics metrics = - new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels)); + new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels)); double auROC = metrics.areaUnderROC(); System.out.println(Area under ROC = + auROC); @@ -610,14 +610,14 @@ this way as well. For example, the following code produces an L1 regularized variant of SVMs with regularization parameter set to 0.1, and runs the training algorithm for 200 iterations. -import org.apache.spark.mllib.optimization.L1Updater; +import org.apache.spark.mllib.optimization.L1Updater; -SVMWithSGD svmAlg = new SVMWithSGD(); +SVMWithSGD svmAlg = new SVMWithSGD(); svmAlg.optimizer() .setNumIterations(200) .setRegParam(0.1) - .setUpdater(new L1Updater()); -final SVMModel modelL1 = svmAlg.run(training.rdd()); + .setUpdater(new L1Updater()); +final SVMModel modelL1 = svmAlg.run(training.rdd()); In order to run the above application, follow the instructions provided in the Self-Contained @@ -632,28 +632,28 @@ and make predictions with the resulting model to compute the training error. Refer to the SVMWithSGD Python docs and SVMModel Python docs for more details on the API. -from pyspark.mllib.classification import SVMWithSGD, SVMModel +from pyspark.mllib.classification import SVMWithSGD, SVMModel from pyspark.mllib.regression import LabeledPoint -# Load and parse the data +# Load and parse the data def parsePoint(line): -values = [float(x) for x in line.split( )] +values = [float(x) for x in line.split( )] return LabeledPoint(values[0], values[1:]) -data = sc.textFile(data/mllib/sample_svm_data.txt) +data = sc.textFile(data/mllib/sample_svm_data.txt) parsedData = data.map(parsePoint) -# Build the model +# Build the model model = SVMWithSGD.train(parsedData, iterations=100) -# Evaluating the model on training data +# Evaluating the model on training data labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features))) trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() / float(parsedData.count()) -print(Training Error = + str(trainErr)) +print(Training Error = + str(trainErr)) -# Save and load model -model.save(sc, target/tmp/pythonSVMWithSGDModel) -sameModel = SVMModel.load(sc, target/tmp/pythonSVMWithSGDModel) +# Save and load model +model.save(sc, target/tmp/pythonSVMWithSGDModel) +sameModel =
[23/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/hadoop-provided.html -- diff --git a/site/docs/2.1.0/hadoop-provided.html b/site/docs/2.1.0/hadoop-provided.html index ff7afb7..9d77cf0 100644 --- a/site/docs/2.1.0/hadoop-provided.html +++ b/site/docs/2.1.0/hadoop-provided.html @@ -133,16 +133,16 @@ Apache Hadoop For Apache distributions, you can use Hadoops classpath command. For instance: -### in conf/spark-env.sh ### +### in conf/spark-env.sh ### -# If hadoop binary is on your PATH -export SPARK_DIST_CLASSPATH=$(hadoop classpath) +# If hadoop binary is on your PATH +export SPARK_DIST_CLASSPATH=$(hadoop classpath) -# With explicit path to hadoop binary -export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath) +# With explicit path to hadoop binary +export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath) -# Passing a Hadoop configuration directory -export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath) +# Passing a Hadoop configuration directory +export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath) http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming-watermark.png -- diff --git a/site/docs/2.1.0/img/structured-streaming-watermark.png b/site/docs/2.1.0/img/structured-streaming-watermark.png new file mode 100644 index 000..f21fbda Binary files /dev/null and b/site/docs/2.1.0/img/structured-streaming-watermark.png differ http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming.pptx -- diff --git a/site/docs/2.1.0/img/structured-streaming.pptx b/site/docs/2.1.0/img/structured-streaming.pptx index 6aad2ed..f5bdfc0 100644 Binary files a/site/docs/2.1.0/img/structured-streaming.pptx and b/site/docs/2.1.0/img/structured-streaming.pptx differ http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/job-scheduling.html -- diff --git a/site/docs/2.1.0/job-scheduling.html b/site/docs/2.1.0/job-scheduling.html index 53161c2..9651607 100644 --- a/site/docs/2.1.0/job-scheduling.html +++ b/site/docs/2.1.0/job-scheduling.html @@ -127,24 +127,24 @@ - Overview - Scheduling Across Applications - Dynamic Resource Allocation - Configuration and Setup - Resource Allocation Policy - Request Policy - Remove Policy + Overview + Scheduling Across Applications + Dynamic Resource Allocation + Configuration and Setup + Resource Allocation Policy + Request Policy + Remove Policy - Graceful Decommission of Executors + Graceful Decommission of Executors - Scheduling Within an Application - Fair Scheduler Pools - Default Behavior of Pools - Configuring Pool Properties + Scheduling Within an Application + Fair Scheduler Pools + Default Behavior of Pools + Configuring Pool Properties @@ -321,9 +321,9 @@ mode is best for multi-user settings. To enable the fair scheduler, simply set the spark.scheduler.mode property to FAIR when configuring a SparkContext: -val conf = new SparkConf().setMaster(...).setAppName(...) +val conf = new SparkConf().setMaster(...).setAppName(...) conf.set(spark.scheduler.mode, FAIR) -val sc = new SparkContext(conf) +val sc = new SparkContext(conf) Fair Scheduler Pools @@ -337,15 +337,15 @@ many concurrent jobs they have instead of giving jobs equal shares. Thi adding the spark.scheduler.pool local property to the SparkContext in the thread thats submitting them. This is done as follows: -// Assuming sc is your SparkContext variable -sc.setLocalProperty(spark.scheduler.pool, pool1) +// Assuming sc is your SparkContext variable +sc.setLocalProperty(spark.scheduler.pool, pool1) After setting this local property, all jobs submitted within this thread (by calls in this thread to RDD.save, count, collect, etc) will use this pool name. The setting is per-thread to make it easy to have a thread run multiple jobs on behalf of the same user. If youd like to clear the pool that a thread is associated with, simply call: -sc.setLocalProperty(spark.scheduler.pool, null) +sc.setLocalProperty(spark.scheduler.pool, null) Default Behavior of Pools @@ -379,12 +379,12 @@ of the cluster. By default, each pools minShare is 0. and setting a spark.scheduler.allocation.file property in your SparkConf. -conf.set(spark.scheduler.allocation.file,
[12/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-feature-extraction.html -- diff --git a/site/docs/2.1.0/mllib-feature-extraction.html b/site/docs/2.1.0/mllib-feature-extraction.html index 4726b37..f8cd98e 100644 --- a/site/docs/2.1.0/mllib-feature-extraction.html +++ b/site/docs/2.1.0/mllib-feature-extraction.html @@ -307,32 +307,32 @@ - TF-IDF - Word2Vec - Model - Example + TF-IDF + Word2Vec + Model + Example - StandardScaler - Model Fitting - Example + StandardScaler + Model Fitting + Example - Normalizer - Example + Normalizer + Example - ChiSqSelector - Model Fitting - Example + ChiSqSelector + Model Fitting + Example - ElementwiseProduct - Example + ElementwiseProduct + Example - PCA - Example + PCA + Example @@ -390,7 +390,7 @@ Each record could be an iterable of strings or other types. Refer to the HashingTF Scala docs for details on the API. -import org.apache.spark.mllib.feature.{HashingTF, IDF} +import org.apache.spark.mllib.feature.{HashingTF, IDF} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.rdd.RDD @@ -424,24 +424,24 @@ Each record could be an iterable of strings or other types. Refer to the HashingTF Python docs for details on the API. -from pyspark.mllib.feature import HashingTF, IDF +from pyspark.mllib.feature import HashingTF, IDF -# Load documents (one per line). -documents = sc.textFile(data/mllib/kmeans_data.txt).map(lambda line: line.split( )) +# Load documents (one per line). +documents = sc.textFile(data/mllib/kmeans_data.txt).map(lambda line: line.split( )) hashingTF = HashingTF() tf = hashingTF.transform(documents) -# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: -# First to compute the IDF vector and second to scale the term frequencies by IDF. +# While applying HashingTF only needs a single pass to the data, applying IDF needs two passes: +# First to compute the IDF vector and second to scale the term frequencies by IDF. tf.cache() idf = IDF().fit(tf) tfidf = idf.transform(tf) -# spark.mllibs IDF implementation provides an option for ignoring terms -# which occur in less than a minimum number of documents. -# In such cases, the IDF for these terms is set to 0. -# This feature can be used by passing the minDocFreq value to the IDF constructor. +# spark.mllibs IDF implementation provides an option for ignoring terms +# which occur in less than a minimum number of documents. +# In such cases, the IDF for these terms is set to 0. +# This feature can be used by passing the minDocFreq value to the IDF constructor. idfIgnore = IDF(minDocFreq=2).fit(tf) tfidfIgnore = idfIgnore.transform(tf) @@ -467,7 +467,7 @@ skip-gram model is to maximize the average log-likelihood \[ \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t) \] -where $k$ is the size of the training window. +where $k$ is the size of the training window. In the skip-gram model, every word $w$ is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly @@ -475,7 +475,7 @@ predicting word $w_i$ given word $w_j$ is determined by the softmax model, which \[ p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})} \] -where $V$ is the vocabulary size. +where $V$ is the vocabulary size. The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, @@ -488,13 +488,13 @@ $O(\log(V))$ construct a Word2Vec instance and then fit a Word2VecModel with the input data. Finally, we display the top 40 synonyms of the specified word. To run the example, first download the http://mattmahoney.net/dc/text8.zip;>text8 data and extract it to your preferred directory. -Here we assume the extracted file is text8 and in same directory as you run the spark shell. +Here we assume the extracted file is text8 and in same directory as you run the spark shell. Refer to the Word2Vec Scala docs for details on the API. -import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} +import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} val input = sc.textFile(data/mllib/sample_lda_data.txt).map(line = line.split( ).toSeq) @@ -505,7 +505,7 @@ Here we assume the extracted file is text8 and in same directory as val synonyms = model.findSynonyms(1, 5) for((synonym, cosineSimilarity) - synonyms) { -
[17/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-clustering.html -- diff --git a/site/docs/2.1.0/mllib-clustering.html b/site/docs/2.1.0/mllib-clustering.html index 9667606..1b50dab 100644 --- a/site/docs/2.1.0/mllib-clustering.html +++ b/site/docs/2.1.0/mllib-clustering.html @@ -366,12 +366,12 @@ models are trained for each cluster). The spark.mllib package supports the following models: - K-means - Gaussian mixture - Power iteration clustering (PIC) - Latent Dirichlet allocation (LDA) - Bisecting k-means - Streaming k-means + K-means + Gaussian mixture + Power iteration clustering (PIC) + Latent Dirichlet allocation (LDA) + Bisecting k-means + Streaming k-means K-means @@ -408,7 +408,7 @@ optimal k is usually one where there is an elbow in the W Refer to the KMeans Scala docs and KMeansModel Scala docs for details on the API. -import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} +import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data @@ -440,7 +440,7 @@ that is equivalent to the provided example in Scala is given below: Refer to the KMeans Java docs and KMeansModel Java docs for details on the API. -import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.clustering.KMeans; import org.apache.spark.mllib.clustering.KMeansModel; @@ -470,7 +470,7 @@ that is equivalent to the provided example in Scala is given below: KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations); System.out.println(Cluster centers:); -for (Vector center: clusters.clusterCenters()) { +for (Vector center: clusters.clusterCenters()) { System.out.println( + center); } double cost = clusters.computeCost(parsedData.rdd()); @@ -498,29 +498,29 @@ fact the optimal k is usually one where there is an elbow Refer to the KMeans Python docs and KMeansModel Python docs for more details on the API. -from numpy import array +from numpy import array from math import sqrt from pyspark.mllib.clustering import KMeans, KMeansModel -# Load and parse the data -data = sc.textFile(data/mllib/kmeans_data.txt) -parsedData = data.map(lambda line: array([float(x) for x in line.split( )])) +# Load and parse the data +data = sc.textFile(data/mllib/kmeans_data.txt) +parsedData = data.map(lambda line: array([float(x) for x in line.split( )])) -# Build the model (cluster the data) -clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=random) +# Build the model (cluster the data) +clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=random) -# Evaluate clustering by computing Within Set Sum of Squared Errors +# Evaluate clustering by computing Within Set Sum of Squared Errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y) -print(Within Set Sum of Squared Error = + str(WSSSE)) +print(Within Set Sum of Squared Error = + str(WSSSE)) -# Save and load model -clusters.save(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel) -sameModel = KMeansModel.load(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel) +# Save and load model +clusters.save(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel) +sameModel = KMeansModel.load(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel) Find full example code at "examples/src/main/python/mllib/k_means_example.py" in the Spark repo. @@ -554,7 +554,7 @@ to the algorithm. We then output the parameters of the mixture model. Refer to the GaussianMixture Scala docs and GaussianMixtureModel Scala docs for details on the API. -import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel} +import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel} import org.apache.spark.mllib.linalg.Vectors // Load and parse the data @@ -587,7 +587,7 @@ that is equivalent to the provided example in Scala is given below: Refer to the GaussianMixture Java docs and GaussianMixtureModel Java docs for details on the API. -import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.mllib.clustering.GaussianMixture; import org.apache.spark.mllib.clustering.GaussianMixtureModel; @@ -612,7 +612,7 @@ that is equivalent to the provided example in Scala is given below: parsedData.cache(); // Cluster the data into two classes using GaussianMixture -GaussianMixtureModel
[13/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-evaluation-metrics.html -- diff --git a/site/docs/2.1.0/mllib-evaluation-metrics.html b/site/docs/2.1.0/mllib-evaluation-metrics.html index 4bc636d..0d5bb3b 100644 --- a/site/docs/2.1.0/mllib-evaluation-metrics.html +++ b/site/docs/2.1.0/mllib-evaluation-metrics.html @@ -307,20 +307,20 @@ - Classification model evaluation - Binary classification - Threshold tuning + Classification model evaluation + Binary classification + Threshold tuning - Multiclass classification - Label based metrics + Multiclass classification + Label based metrics - Multilabel classification - Ranking systems + Multilabel classification + Ranking systems - Regression model evaluation + Regression model evaluation spark.mllib comes with a number of machine learning algorithms that can be used to learn from and make predictions @@ -421,7 +421,7 @@ data, and evaluate the performance of the algorithm by several binary evaluation Refer to the LogisticRegressionWithLBFGS Scala docs and BinaryClassificationMetrics Scala docs for details on the API. -import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS +import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils @@ -453,13 +453,13 @@ data, and evaluate the performance of the algorithm by several binary evaluation // Precision by threshold val precision = metrics.precisionByThreshold precision.foreach { case (t, p) = - println(sThreshold: $t, Precision: $p) + println(sThreshold: $t, Precision: $p) } // Recall by threshold val recall = metrics.recallByThreshold recall.foreach { case (t, r) = - println(sThreshold: $t, Recall: $r) + println(sThreshold: $t, Recall: $r) } // Precision-Recall Curve @@ -468,13 +468,13 @@ data, and evaluate the performance of the algorithm by several binary evaluation // F-measure val f1Score = metrics.fMeasureByThreshold f1Score.foreach { case (t, f) = - println(sThreshold: $t, F-score: $f, Beta = 1) + println(sThreshold: $t, F-score: $f, Beta = 1) } val beta = 0.5 val fScore = metrics.fMeasureByThreshold(beta) f1Score.foreach { case (t, f) = - println(sThreshold: $t, F-score: $f, Beta = 0.5) + println(sThreshold: $t, F-score: $f, Beta = 0.5) } // AUPRC @@ -498,7 +498,7 @@ data, and evaluate the performance of the algorithm by several binary evaluation Refer to the LogisticRegressionModel Java docs and LogisticRegressionWithLBFGS Java docs for details on the API. -import scala.Tuple2; +import scala.Tuple2; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -518,7 +518,7 @@ data, and evaluate the performance of the algorithm by several binary evaluation JavaRDDLabeledPoint test = splits[1]; // Run training algorithm to build the model. -final LogisticRegressionModel model = new LogisticRegressionWithLBFGS() +final LogisticRegressionModel model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(training.rdd()); @@ -538,7 +538,7 @@ data, and evaluate the performance of the algorithm by several binary evaluation // Get evaluation metrics. BinaryClassificationMetrics metrics = - new BinaryClassificationMetrics(predictionAndLabels.rdd()); + new BinaryClassificationMetrics(predictionAndLabels.rdd()); // Precision by threshold JavaRDDTuple2Object, Object precision = metrics.precisionByThreshold().toJavaRDD(); @@ -564,7 +564,7 @@ data, and evaluate the performance of the algorithm by several binary evaluation new FunctionTuple2Object, Object, Double() { @Override public Double call(Tuple2Object, Object t) { - return new Double(t._1().toString()); + return new Double(t._1().toString()); } } ); @@ -590,34 +590,34 @@ data, and evaluate the performance of the algorithm by several binary evaluation Refer to the BinaryClassificationMetrics Python docs and LogisticRegressionWithLBFGS Python docs for more details on the API. -from pyspark.mllib.classification import LogisticRegressionWithLBFGS +from pyspark.mllib.classification import LogisticRegressionWithLBFGS from pyspark.mllib.evaluation import BinaryClassificationMetrics from pyspark.mllib.regression import LabeledPoint -# Several of the methods available in scala are currently missing from pyspark -# Load training data in LIBSVM format +# Several of the methods available in scala are currently missing from pyspark +#
[08/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/quick-start.html -- diff --git a/site/docs/2.1.0/quick-start.html b/site/docs/2.1.0/quick-start.html index 76e67e1..9d5fad7 100644 --- a/site/docs/2.1.0/quick-start.html +++ b/site/docs/2.1.0/quick-start.html @@ -129,14 +129,14 @@ - Interactive Analysis with the Spark Shell - Basics - More on RDD Operations - Caching + Interactive Analysis with the Spark Shell + Basics + More on RDD Operations + Caching - Self-Contained Applications - Where to Go from Here + Self-Contained Applications + Where to Go from Here This tutorial provides a quick introduction to using Spark. We will first introduce the API through Sparks @@ -164,26 +164,26 @@ or Python. Start it by running the following in the Spark directory: Sparks primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Lets make a new RDD from the text of the README file in the Spark source directory: -scala val textFile = sc.textFile(README.md) -textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at console:25 +scala val textFile = sc.textFile(README.md) +textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at console:25 RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Lets start with a few actions: -scala textFile.count() // Number of items in this RDD +scala textFile.count() // Number of items in this RDD res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs scala textFile.first() // First item in this RDD -res1: String = # Apache Spark +res1: String = # Apache Spark Now lets use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file. -scala val linesWithSpark = textFile.filter(line = line.contains(Spark)) -linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at console:27 +scala val linesWithSpark = textFile.filter(line = line.contains(Spark)) +linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at console:27 We can chain together transformations and actions: -scala textFile.filter(line = line.contains(Spark)).count() // How many lines contain Spark? -res3: Long = 15 +scala textFile.filter(line = line.contains(Spark)).count() // How many lines contain Spark? +res3: Long = 15 @@ -193,24 +193,24 @@ or Python. Start it by running the following in the Spark directory: Sparks primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Lets make a new RDD from the text of the README file in the Spark source directory: - textFile = sc.textFile(README.md) + textFile = sc.textFile(README.md) RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Lets start with a few actions: - textFile.count() # Number of items in this RDD + textFile.count() # Number of items in this RDD 126 - textFile.first() # First item in this RDD -u# Apache Spark + textFile.first() # First item in this RDD +u# Apache Spark Now lets use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file. - linesWithSpark = textFile.filter(lambda line: Spark in line) + linesWithSpark = textFile.filter(lambda line: Spark in line) We can chain together transformations and actions: - textFile.filter(lambda line: Spark in line).count() # How many lines contain Spark? -15 + textFile.filter(lambda line: Spark in line).count() # How many lines contain Spark? +15 @@ -221,38 +221,38 @@ or Python. Start it by running the following in the Spark directory: -scala textFile.map(line = line.split( ).size).reduce((a, b) = if (a b) a else b) -res4: Long = 15 +scala textFile.map(line = line.split( ).size).reduce((a, b) = if (a b) a else b) +res4: Long = 15 This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. Well use Math.max() function to make this code easier to understand: -scala import java.lang.Math +scala
[14/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-decision-tree.html -- diff --git a/site/docs/2.1.0/mllib-decision-tree.html b/site/docs/2.1.0/mllib-decision-tree.html index 1a3d865..991610e 100644 --- a/site/docs/2.1.0/mllib-decision-tree.html +++ b/site/docs/2.1.0/mllib-decision-tree.html @@ -307,23 +307,23 @@ - Basic algorithm - Node impurity and information gain - Split candidates - Stopping rule + Basic algorithm + Node impurity and information gain + Split candidates + Stopping rule - Usage tips - Problem specification parameters - Stopping criteria - Tunable parameters - Caching and checkpointing + Usage tips + Problem specification parameters + Stopping criteria + Tunable parameters + Caching and checkpointing - Scaling - Examples - Classification - Regression + Scaling + Examples + Classification + Regression @@ -548,7 +548,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a Refer to the DecisionTree Scala docs and DecisionTreeModel Scala docs for details on the API. -import org.apache.spark.mllib.tree.DecisionTree +import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel import org.apache.spark.mllib.util.MLUtils @@ -588,7 +588,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a Refer to the DecisionTree Java docs and DecisionTreeModel Java docs for details on the API. -import java.util.HashMap; +import java.util.HashMap; import java.util.Map; import scala.Tuple2; @@ -604,8 +604,8 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a import org.apache.spark.mllib.tree.model.DecisionTreeModel; import org.apache.spark.mllib.util.MLUtils; -SparkConf sparkConf = new SparkConf().setAppName(JavaDecisionTreeClassificationExample); -JavaSparkContext jsc = new JavaSparkContext(sparkConf); +SparkConf sparkConf = new SparkConf().setAppName(JavaDecisionTreeClassificationExample); +JavaSparkContext jsc = new JavaSparkContext(sparkConf); // Load and parse the data file. String datapath = data/mllib/sample_libsvm_data.txt; @@ -657,30 +657,30 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a Refer to the DecisionTree Python docs and DecisionTreeModel Python docs for more details on the API. -from pyspark.mllib.tree import DecisionTree, DecisionTreeModel +from pyspark.mllib.tree import DecisionTree, DecisionTreeModel from pyspark.mllib.util import MLUtils -# Load and parse the data file into an RDD of LabeledPoint. -data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) -# Split the data into training and test sets (30% held out for testing) +# Load and parse the data file into an RDD of LabeledPoint. +data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) +# Split the data into training and test sets (30% held out for testing) (trainingData, testData) = data.randomSplit([0.7, 0.3]) -# Train a DecisionTree model. -# Empty categoricalFeaturesInfo indicates all features are continuous. +# Train a DecisionTree model. +# Empty categoricalFeaturesInfo indicates all features are continuous. model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, - impurity=gini, maxDepth=5, maxBins=32) + impurity=gini, maxDepth=5, maxBins=32) -# Evaluate model on test instances and compute test error +# Evaluate model on test instances and compute test error predictions = model.predict(testData.map(lambda x: x.features)) labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions) testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count()) -print(Test Error = + str(testErr)) -print(Learned classification tree model:) +print(Test Error = + str(testErr)) +print(Learned classification tree model:) print(model.toDebugString()) -# Save and load model -model.save(sc, target/tmp/myDecisionTreeClassificationModel) -sameModel = DecisionTreeModel.load(sc, target/tmp/myDecisionTreeClassificationModel) +# Save and load model +model.save(sc, target/tmp/myDecisionTreeClassificationModel) +sameModel = DecisionTreeModel.load(sc, target/tmp/myDecisionTreeClassificationModel) Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo. @@ -701,7 +701,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate Refer to the DecisionTree Scala docs and DecisionTreeModel Scala
[20/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-features.html -- diff --git a/site/docs/2.1.0/ml-features.html b/site/docs/2.1.0/ml-features.html index 64463de..a2f102b 100644 --- a/site/docs/2.1.0/ml-features.html +++ b/site/docs/2.1.0/ml-features.html @@ -318,52 +318,52 @@ Table of Contents - Feature Extractors - TF-IDF - Word2Vec - CountVectorizer + Feature Extractors + TF-IDF + Word2Vec + CountVectorizer - Feature Transformers - Tokenizer - StopWordsRemover - $n$-gram - Binarizer - PCA - PolynomialExpansion - Discrete Cosine Transform (DCT) - StringIndexer - IndexToString - OneHotEncoder - VectorIndexer - Interaction - Normalizer - StandardScaler - MinMaxScaler - MaxAbsScaler - Bucketizer - ElementwiseProduct - SQLTransformer - VectorAssembler - QuantileDiscretizer + Feature Transformers + Tokenizer + StopWordsRemover + $n$-gram + Binarizer + PCA + PolynomialExpansion + Discrete Cosine Transform (DCT) + StringIndexer + IndexToString + OneHotEncoder + VectorIndexer + Interaction + Normalizer + StandardScaler + MinMaxScaler + MaxAbsScaler + Bucketizer + ElementwiseProduct + SQLTransformer + VectorAssembler + QuantileDiscretizer - Feature Selectors - VectorSlicer - RFormula - ChiSqSelector + Feature Selectors + VectorSlicer + RFormula + ChiSqSelector - Locality Sensitive Hashing - LSH Operations - Feature Transformation - Approximate Similarity Join - Approximate Nearest Neighbor Search + Locality Sensitive Hashing + LSH Operations + Feature Transformation + Approximate Similarity Join + Approximate Nearest Neighbor Search - LSH Algorithms - Bucketed Random Projection for Euclidean Distance - MinHash for Jaccard Distance + LSH Algorithms + Bucketed Random Projection for Euclidean Distance + MinHash for Jaccard Distance @@ -395,7 +395,7 @@ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible. -TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors. +TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors. HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a set of terms might be a bag of words. @@ -437,7 +437,7 @@ when using text as features. Our feature vectors could then be passed to a lear Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API. -import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer} +import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer} val sentenceData = spark.createDataFrame(Seq( (0.0, Hi I heard about Spark), @@ -468,7 +468,7 @@ the IDF Scala doc Refer to the HashingTF Java docs and the IDF Java docs for more details on the API. -import java.util.Arrays; +import java.util.Arrays; import java.util.List; import org.apache.spark.ml.feature.HashingTF; @@ -489,17 +489,17 @@ the IDF Scala doc RowFactory.create(0.0, I wish Java could use case classes), RowFactory.create(1.0, Logistic regression models are neat) ); -StructType schema = new StructType(new StructField[]{ - new StructField(label, DataTypes.DoubleType, false, Metadata.empty()), - new StructField(sentence, DataTypes.StringType, false, Metadata.empty()) +StructType schema = new StructType(new StructField[]{ + new StructField(label, DataTypes.DoubleType, false, Metadata.empty()), + new StructField(sentence, DataTypes.StringType, false, Metadata.empty()) }); DatasetRow sentenceData = spark.createDataFrame(data, schema); -Tokenizer tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words); +Tokenizer tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words); DatasetRow wordsData = tokenizer.transform(sentenceData); int numFeatures = 20; -HashingTF hashingTF = new HashingTF() +HashingTF hashingTF = new HashingTF() .setInputCol(words) .setOutputCol(rawFeatures) .setNumFeatures(numFeatures); @@ -507,7 +507,7 @@ the IDF Scala doc DatasetRow featurizedData = hashingTF.transform(wordsData); // alternatively, CountVectorizer can also be used to get term frequency vectors -IDF idf = new IDF().setInputCol(rawFeatures).setOutputCol(features); +IDF
[22/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-classification-regression.html -- diff --git a/site/docs/2.1.0/ml-classification-regression.html b/site/docs/2.1.0/ml-classification-regression.html index 1e0665b..0b264bb 100644 --- a/site/docs/2.1.0/ml-classification-regression.html +++ b/site/docs/2.1.0/ml-classification-regression.html @@ -329,58 +329,58 @@ discussing specific classes of algorithms, such as linear methods, trees, and en Table of Contents - Classification - Logistic regression - Binomial logistic regression - Multinomial logistic regression + Classification + Logistic regression + Binomial logistic regression + Multinomial logistic regression - Decision tree classifier - Random forest classifier - Gradient-boosted tree classifier - Multilayer perceptron classifier - One-vs-Rest classifier (a.k.a. One-vs-All) - Naive Bayes + Decision tree classifier + Random forest classifier + Gradient-boosted tree classifier + Multilayer perceptron classifier + One-vs-Rest classifier (a.k.a. One-vs-All) + Naive Bayes - Regression - Linear regression - Generalized linear regression - Available families + Regression + Linear regression + Generalized linear regression + Available families - Decision tree regression - Random forest regression - Gradient-boosted tree regression - Survival regression - Isotonic regression - Examples + Decision tree regression + Random forest regression + Gradient-boosted tree regression + Survival regression + Isotonic regression + Examples - Linear methods - Decision trees - Inputs and Outputs - Input Columns - Output Columns + Linear methods + Decision trees + Inputs and Outputs + Input Columns + Output Columns - Tree Ensembles - Random Forests - Inputs and Outputs - Input Columns - Output Columns (Predictions) + Tree Ensembles + Random Forests + Inputs and Outputs + Input Columns + Output Columns (Predictions) - Gradient-Boosted Trees (GBTs) - Inputs and Outputs - Input Columns - Output Columns (Predictions) + Gradient-Boosted Trees (GBTs) + Inputs and Outputs + Input Columns + Output Columns (Predictions) @@ -407,7 +407,7 @@ parameter to select between these two algorithms, or leave it unset and Spark wi Binomial logistic regression -For more background and more details about the implementation of binomial logistic regression, refer to the documentation of logistic regression in spark.mllib. +For more background and more details about the implementation of binomial logistic regression, refer to the documentation of logistic regression in spark.mllib. Example @@ -421,7 +421,7 @@ $\alpha$ and regParam corresponds to $\lambda$. More details on parameters can be found in the Scala API documentation. -import org.apache.spark.ml.classification.LogisticRegression +import org.apache.spark.ml.classification.LogisticRegression // Load training data val training = spark.read.format(libsvm).load(data/mllib/sample_libsvm_data.txt) @@ -435,7 +435,7 @@ $\alpha$ and regParam corresponds to $\lambda$. val lrModel = lr.fit(training) // Print the coefficients and intercept for logistic regression -println(sCoefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}) +println(sCoefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}) // We can also use the multinomial family for binary classification val mlr = new LogisticRegression() @@ -447,8 +447,8 @@ $\alpha$ and regParam corresponds to $\lambda$. val mlrModel = mlr.fit(training) // Print the coefficients and intercepts for logistic regression with multinomial family -println(sMultinomial coefficients: ${mlrModel.coefficientMatrix}) -println(sMultinomial intercepts: ${mlrModel.interceptVector}) +println(sMultinomial coefficients: ${mlrModel.coefficientMatrix}) +println(sMultinomial intercepts: ${mlrModel.interceptVector}) Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala" in the Spark repo. @@ -457,7 +457,7 @@ $\alpha$ and regParam corresponds to $\lambda$.
[05/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/storage-openstack-swift.html -- diff --git a/site/docs/2.1.0/storage-openstack-swift.html b/site/docs/2.1.0/storage-openstack-swift.html index bbb3446..a20c67f 100644 --- a/site/docs/2.1.0/storage-openstack-swift.html +++ b/site/docs/2.1.0/storage-openstack-swift.html @@ -144,7 +144,7 @@ Current Swift driver requires Swift to use Keystone authentication method. The Spark application should include hadoop-openstack dependency. For example, for Maven support, add the following to the pom.xml file: -dependencyManagement +dependencyManagement ... dependency groupIdorg.apache.hadoop/groupId @@ -152,15 +152,15 @@ For example, for Maven support, add the following to the pom.xml fi version2.3.0/version /dependency ... -/dependencyManagement +/dependencyManagement Configuration Parameters Create core-site.xml and place it inside Sparks conf directory. There are two main categories of parameters that should to be configured: declaration of the -Swift driver and the parameters that are required by Keystone. +Swift driver and the parameters that are required by Keystone. -Configuration of Hadoop to use Swift File system achieved via +Configuration of Hadoop to use Swift File system achieved via Property NameValue @@ -221,7 +221,7 @@ contains a list of Keystone mandatory parameters. PROVIDER can be a For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing defined for tenant test. Then core-site.xml should include: -configuration +configuration property namefs.swift.impl/name valueorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem/value @@ -257,7 +257,7 @@ defined for tenant test. Then core-site.xml should inc namefs.swift.service.SparkTest.password/name valuetesting/value /property -/configuration +/configuration Notice that fs.swift.service.PROVIDER.tenant, http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-custom-receivers.html -- diff --git a/site/docs/2.1.0/streaming-custom-receivers.html b/site/docs/2.1.0/streaming-custom-receivers.html index d31647d..846c797 100644 --- a/site/docs/2.1.0/streaming-custom-receivers.html +++ b/site/docs/2.1.0/streaming-custom-receivers.html @@ -171,7 +171,7 @@ has any error connecting or receiving, the receiver is restarted to make another -class CustomReceiver(host: String, port: Int) +class CustomReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging { def onStart() { @@ -216,12 +216,12 @@ has any error connecting or receiving, the receiver is restarted to make another restart(Error receiving data, t) } } -} +} -public class JavaCustomReceiver extends ReceiverString { +public class JavaCustomReceiver extends ReceiverString { String host = null; int port = -1; @@ -234,7 +234,7 @@ has any error connecting or receiving, the receiver is restarted to make another public void onStart() { // Start the thread that receives data over a connection -new Thread() { +new Thread() { @Override public void run() { receive(); } @@ -253,10 +253,10 @@ has any error connecting or receiving, the receiver is restarted to make another try { // connect to the server - socket = new Socket(host, port); + socket = new Socket(host, port); - BufferedReader reader = new BufferedReader( -new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8)); + BufferedReader reader = new BufferedReader( +new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8)); // Until stopped or connection broken continue reading while (!isStopped() (userInput = reader.readLine()) != null) { @@ -276,7 +276,7 @@ has any error connecting or receiving, the receiver is restarted to make another restart(Error receiving data, t); } } -} +} @@ -290,20 +290,20 @@ an input DStream using data received by the instance of custom receiver, as show -// Assuming ssc is the StreamingContext +// Assuming ssc is the StreamingContext val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port)) val words = lines.flatMap(_.split( )) -... +... The full source code is in the example https://github.com/apache/spark/blob/v2.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala;>CustomReceiver.scala. -// Assuming ssc is the JavaStreamingContext -JavaDStreamString customReceiverStream = ssc.receiverStream(new JavaCustomReceiver(host, port)); +// Assuming ssc is the JavaStreamingContext
[21/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-clustering.html -- diff --git a/site/docs/2.1.0/ml-clustering.html b/site/docs/2.1.0/ml-clustering.html index e225281..df38605 100644 --- a/site/docs/2.1.0/ml-clustering.html +++ b/site/docs/2.1.0/ml-clustering.html @@ -313,21 +313,21 @@ about these algorithms. Table of Contents - K-means - Input Columns - Output Columns - Example + K-means + Input Columns + Output Columns + Example - Latent Dirichlet allocation (LDA) - Bisecting k-means - Example + Latent Dirichlet allocation (LDA) + Bisecting k-means + Example - Gaussian Mixture Model (GMM) - Input Columns - Output Columns - Example + Gaussian Mixture Model (GMM) + Input Columns + Output Columns + Example @@ -391,7 +391,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea Refer to the Scala API docs for more details. -import org.apache.spark.ml.clustering.KMeans +import org.apache.spark.ml.clustering.KMeans // Loads data. val dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt) @@ -402,7 +402,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea // Evaluate clustering by computing Within Set Sum of Squared Errors. val WSSSE = model.computeCost(dataset) -println(sWithin Set Sum of Squared Errors = $WSSSE) +println(sWithin Set Sum of Squared Errors = $WSSSE) // Shows the result. println(Cluster Centers: ) @@ -414,7 +414,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea Refer to the Java API docs for more details. -import org.apache.spark.ml.clustering.KMeansModel; +import org.apache.spark.ml.clustering.KMeansModel; import org.apache.spark.ml.clustering.KMeans; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.sql.Dataset; @@ -424,7 +424,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea DatasetRow dataset = spark.read().format(libsvm).load(data/mllib/sample_kmeans_data.txt); // Trains a k-means model. -KMeans kmeans = new KMeans().setK(2).setSeed(1L); +KMeans kmeans = new KMeans().setK(2).setSeed(1L); KMeansModel model = kmeans.fit(dataset); // Evaluate clustering by computing Within Set Sum of Squared Errors. @@ -434,7 +434,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea // Shows the result. Vector[] centers = model.clusterCenters(); System.out.println(Cluster Centers: ); -for (Vector center: centers) { +for (Vector center: centers) { System.out.println(center); } @@ -444,22 +444,22 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea Refer to the Python API docs for more details. -from pyspark.ml.clustering import KMeans +from pyspark.ml.clustering import KMeans -# Loads data. -dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt) +# Loads data. +dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt) -# Trains a k-means model. +# Trains a k-means model. kmeans = KMeans().setK(2).setSeed(1) model = kmeans.fit(dataset) -# Evaluate clustering by computing Within Set Sum of Squared Errors. +# Evaluate clustering by computing Within Set Sum of Squared Errors. wssse = model.computeCost(dataset) -print(Within Set Sum of Squared Errors = + str(wssse)) +print(Within Set Sum of Squared Errors = + str(wssse)) -# Shows the result. +# Shows the result. centers = model.clusterCenters() -print(Cluster Centers: ) +print(Cluster Centers: ) for center in centers: print(center) @@ -470,7 +470,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea Refer to the R API docs for more details. -# Fit a k-means model with spark.kmeans +# Fit a k-means model with spark.kmeans irisDF - suppressWarnings(createDataFrame(iris)) kmeansDF - irisDF kmeansTestDF - irisDF @@ -504,7 +504,7 @@ and generates a LDAModel as the base model. Expert users may cast a Refer to the Scala API docs for more details. -import org.apache.spark.ml.clustering.LDA +import org.apache.spark.ml.clustering.LDA // Loads data. val dataset = spark.read.format(libsvm) @@ -516,8 +516,8 @@ and generates a LDAModel as the base model. Expert users may cast a val ll = model.logLikelihood(dataset) val lp = model.logPerplexity(dataset) -println(sThe lower bound on the log likelihood of the entire corpus: $ll) -println(sThe upper bound bound on perplexity: $lp) +println(sThe lower bound on the log likelihood of the entire corpus: $ll) +println(sThe upper bound bound on perplexity: $lp) // Describe topics. val topics = model.describeTopics(3) @@ -535,7 +535,7 @@ and generates a LDAModel as the base
[02/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-programming-guide.html -- diff --git a/site/docs/2.1.0/structured-streaming-programming-guide.html b/site/docs/2.1.0/structured-streaming-programming-guide.html index e54c101..3a1ac5f 100644 --- a/site/docs/2.1.0/structured-streaming-programming-guide.html +++ b/site/docs/2.1.0/structured-streaming-programming-guide.html @@ -127,45 +127,50 @@ - Overview - Quick Example - Programming Model - Basic Concepts - Handling Event-time and Late Data - Fault Tolerance Semantics + Overview + Quick Example + Programming Model + Basic Concepts + Handling Event-time and Late Data + Fault Tolerance Semantics - API using Datasets and DataFrames - Creating streaming DataFrames and streaming Datasets - Data Sources - Schema inference and partition of streaming DataFrames/Datasets + API using Datasets and DataFrames + Creating streaming DataFrames and streaming Datasets + Data Sources + Schema inference and partition of streaming DataFrames/Datasets - Operations on streaming DataFrames/Datasets - Basic Operations - Selection, Projection, Aggregation - Window Operations on Event Time - Join Operations - Unsupported Operations + Operations on streaming DataFrames/Datasets + Basic Operations - Selection, Projection, Aggregation + Window Operations on Event Time + Handling Late Data and Watermarking + Join Operations + Unsupported Operations - Starting Streaming Queries - Output Modes - Output Sinks - Using Foreach + Starting Streaming Queries + Output Modes + Output Sinks + Using Foreach - Managing Streaming Queries - Monitoring Streaming Queries - Recovering from Failures with Checkpointing + Managing Streaming Queries + Monitoring Streaming Queries + Interactive APIs + Asynchronous API + + + Recovering from Failures with Checkpointing - Where to go from here + Where to go from here Overview Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data.The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java or Python to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming. -Spark 2.0 is the ALPHA RELEASE of Structured Streaming and the APIs are still experimental. In this guide, we are going to walk you through the programming model and the APIs. First, lets start with a simple example - a streaming word count. +Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still experimental. In this guide, we are going to walk you through the programming model and the APIs. First, lets start with a simple example - a streaming word count. Quick Example Letâs say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Letâs see how you can express this using Structured Streaming. You can see the full code in @@ -175,7 +180,7 @@ And if you http://spark.apache.org/downloads.html;>download Spark, -import org.apache.spark.sql.functions._ +import org.apache.spark.sql.functions._ import org.apache.spark.sql.SparkSession val spark = SparkSession @@ -183,12 +188,12 @@ And if you http://spark.apache.org/downloads.html;>download Spark, .appName(StructuredNetworkWordCount) .getOrCreate() -import spark.implicits._ +import spark.implicits._ -import org.apache.spark.api.java.function.FlatMapFunction; +import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.sql.*; import org.apache.spark.sql.streaming.StreamingQuery; @@ -198,19 +203,19 @@ And if you http://spark.apache.org/downloads.html;>download Spark, SparkSession spark = SparkSession .builder() .appName(JavaStructuredNetworkWordCount) - .getOrCreate();
[10/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html -- diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html b/site/docs/2.1.0/mllib-pmml-model-export.html index 30815e0..3f2fd91 100644 --- a/site/docs/2.1.0/mllib-pmml-model-export.html +++ b/site/docs/2.1.0/mllib-pmml-model-export.html @@ -307,8 +307,8 @@ - spark.mllib supported models - Examples + spark.mllib supported models + Examples spark.mllib supported models @@ -353,32 +353,31 @@ Refer to the KMeans Scala docs and Vectors Scala docs for details on the API. -Here a complete example of building a KMeansModel and print it out in PMML format: -import org.apache.spark.mllib.clustering.KMeans -import org.apache.spark.mllib.linalg.Vectors +Here a complete example of building a KMeansModel and print it out in PMML format: +div class="highlight"preimport org.apache.spark.mllib.clustering.KMeans +import org.apache.spark.mllib.linalg.Vectors -// Load and parse the data +// Load and parse the data val data = sc.textFile(data/mllib/kmeans_data.txt) -val parsedData = data.map(s = Vectors.dense(s.split( ).map(_.toDouble))).cache() +val parsedData = data.map(s = Vectors.dense(s.split( ).map(_.toDouble))).cache() -// Cluster the data into two classes using KMeans +// Cluster the data into two classes using KMeans val numClusters = 2 val numIterations = 20 -val clusters = KMeans.train(parsedData, numClusters, numIterations) +val clusters = KMeans.train(parsedData, numClusters, numIterations) -// Export to PMML to a String in PMML format -println(PMML Model:\n + clusters.toPMML) +// Export to PMML to a String in PMML format +println(PMML Model:\n + clusters.toPMML) -// Export the model to a local file in PMML format -clusters.toPMML(/tmp/kmeans.xml) +// Export the model to a local file in PMML format +clusters.toPMML(/tmp/kmeans.xml) -// Export the model to a directory on a distributed file system in PMML format -clusters.toPMML(sc, /tmp/kmeans) +// Export the model to a directory on a distributed file system in PMML format +clusters.toPMML(sc, /tmp/kmeans) -// Export the model to the OutputStream in PMML format +// Export the model to the OutputStream in PMML format clusters.toPMML(System.out) - -Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala" in the Spark repo. +/pre/divdivFind full example code at examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala in the Spark repo./div For unsupported models, either you will not find a .toPMML method or an IllegalArgumentException will be thrown. http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html -- diff --git a/site/docs/2.1.0/mllib-statistics.html b/site/docs/2.1.0/mllib-statistics.html index 4485ecf..f04924c 100644 --- a/site/docs/2.1.0/mllib-statistics.html +++ b/site/docs/2.1.0/mllib-statistics.html @@ -358,15 +358,15 @@ - Summary statistics - Correlations - Stratified sampling - Hypothesis testing - Streaming Significance Testing + Summary statistics + Correlations + Stratified sampling + Hypothesis testing + Streaming Significance Testing - Random data generation - Kernel density estimation + Random data generation + Kernel density estimation \[ @@ -401,7 +401,7 @@ total count. Refer to the MultivariateStatisticalSummary Scala docs for details on the API. -import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} val observations = sc.parallelize( @@ -430,7 +430,7 @@ total count. Refer to the MultivariateStatisticalSummary Java docs for details on the API. -import java.util.Arrays; +import java.util.Arrays; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.Vector; @@ -463,19 +463,19 @@ total count. Refer to the MultivariateStatisticalSummary Python docs for more details on the API. -import numpy as np +import numpy as np from pyspark.mllib.stat import Statistics mat = sc.parallelize( [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([3.0, 30.0, 300.0])] -) # an RDD of Vectors +) # an RDD of Vectors -# Compute column summary statistics. +# Compute column summary statistics. summary = Statistics.colStats(mat) -print(summary.mean()) # a dense vector containing the mean value for each column -print(summary.variance()) # column-wise variance -print(summary.numNonzeros()) # number of nonzeros in
[09/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/programming-guide.html -- diff --git a/site/docs/2.1.0/programming-guide.html b/site/docs/2.1.0/programming-guide.html index 12458af..0e06e86 100644 --- a/site/docs/2.1.0/programming-guide.html +++ b/site/docs/2.1.0/programming-guide.html @@ -129,50 +129,50 @@ - Overview - Linking with Spark - Initializing Spark - Using the Shell + Overview + Linking with Spark + Initializing Spark + Using the Shell - Resilient Distributed Datasets (RDDs) - Parallelized Collections - External Datasets - RDD Operations - Basics - Passing Functions to Spark - Understanding closures - Example - Local vs. cluster modes - Printing elements of an RDD + Resilient Distributed Datasets (RDDs) + Parallelized Collections + External Datasets + RDD Operations + Basics + Passing Functions to Spark + Understanding closures + Example + Local vs. cluster modes + Printing elements of an RDD - Working with Key-Value Pairs - Transformations - Actions - Shuffle operations - Background - Performance Impact + Working with Key-Value Pairs + Transformations + Actions + Shuffle operations + Background + Performance Impact - RDD Persistence - Which Storage Level to Choose? - Removing Data + RDD Persistence + Which Storage Level to Choose? + Removing Data - Shared Variables - Broadcast Variables - Accumulators + Shared Variables + Broadcast Variables + Accumulators - Deploying to a Cluster - Launching Spark jobs from Java / Scala - Unit Testing - Where to Go from Here + Deploying to a Cluster + Launching Spark jobs from Java / Scala + Unit Testing + Where to Go from Here Overview @@ -212,8 +212,8 @@ version = your-hdfs-version Finally, you need to import some Spark classes into your program. Add the following lines: -import org.apache.spark.SparkContext -import org.apache.spark.SparkConf +import org.apache.spark.SparkContext +import org.apache.spark.SparkConf (Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._ to enable essential implicit conversions.) @@ -245,9 +245,9 @@ version = your-hdfs-version Finally, you need to import some Spark classes into your program. Add the following lines: -import org.apache.spark.api.java.JavaSparkContext +import org.apache.spark.api.java.JavaSparkContext import org.apache.spark.api.java.JavaRDD -import org.apache.spark.SparkConf +import org.apache.spark.SparkConf @@ -269,13 +269,13 @@ for common HDFS versions. Finally, you need to import some Spark classes into your program. Add the following line: -from pyspark import SparkContext, SparkConf +from pyspark import SparkContext, SparkConf PySpark requires the same minor version of Python in both driver and workers. It uses the default python version in PATH, you can specify which version of Python you want to use by PYSPARK_PYTHON, for example: -$ PYSPARK_PYTHON=python3.4 bin/pyspark -$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/python/pi.py +$ PYSPARK_PYTHON=python3.4 bin/pyspark +$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/python/pi.py @@ -293,8 +293,8 @@ that contains information about your application. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. -val conf = new SparkConf().setAppName(appName).setMaster(master) -new SparkContext(conf) +val conf = new SparkConf().setAppName(appName).setMaster(master) +new SparkContext(conf) @@ -304,8 +304,8 @@ that contains information about your application. how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application. -SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); -JavaSparkContext sc = new JavaSparkContext(conf); +SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); +JavaSparkContext sc = new JavaSparkContext(conf); @@ -315,8 +315,8 @@ that contains information about your application. how to access a cluster. To create a SparkContext you first need to build a SparkConf object that
[07/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sparkr.html -- diff --git a/site/docs/2.1.0/sparkr.html b/site/docs/2.1.0/sparkr.html index 0a1a347..e861a01 100644 --- a/site/docs/2.1.0/sparkr.html +++ b/site/docs/2.1.0/sparkr.html @@ -127,53 +127,53 @@ - Overview - SparkDataFrame - Starting Up: SparkSession - Starting Up from RStudio - Creating SparkDataFrames - From local data frames - From Data Sources - From Hive tables + Overview + SparkDataFrame + Starting Up: SparkSession + Starting Up from RStudio + Creating SparkDataFrames + From local data frames + From Data Sources + From Hive tables - SparkDataFrame Operations - Selecting rows, columns - Grouping, Aggregation - Operating on Columns - Applying User-Defined Function - Run a given function on a large dataset using dapply or dapplyCollect - dapply - dapplyCollect + SparkDataFrame Operations + Selecting rows, columns + Grouping, Aggregation + Operating on Columns + Applying User-Defined Function + Run a given function on a large dataset using dapply or dapplyCollect + dapply + dapplyCollect - Run a given function on a large dataset grouping by input column(s) and using gapply or gapplyCollect - gapply - gapplyCollect + Run a given function on a large dataset grouping by input column(s) and using gapply or gapplyCollect + gapply + gapplyCollect - Data type mapping between R and Spark - Run local R functions distributed using spark.lapply - spark.lapply + Data type mapping between R and Spark + Run local R functions distributed using spark.lapply + spark.lapply - Running SQL Queries from SparkR + Running SQL Queries from SparkR - Machine Learning - Algorithms - Model persistence + Machine Learning + Algorithms + Model persistence - R Function Name Conflicts - Migration Guide - Upgrading From SparkR 1.5.x to 1.6.x - Upgrading From SparkR 1.6.x to 2.0 - Upgrading to SparkR 2.1.0 + R Function Name Conflicts + Migration Guide + Upgrading From SparkR 1.5.x to 1.6.x + Upgrading From SparkR 1.6.x to 2.0 + Upgrading to SparkR 2.1.0 @@ -202,7 +202,7 @@ You can create a SparkSession using sparkR.session and -sparkR.session() +sparkR.session() @@ -223,11 +223,11 @@ them, pass them as you would other configuration properties in the sparkCo -if (nchar(Sys.getenv(SPARK_HOME)) 1) { +if (nchar(Sys.getenv(SPARK_HOME)) 1) { Sys.setenv(SPARK_HOME = /home/spark) } library(SparkR, lib.loc = c(file.path(Sys.getenv(SPARK_HOME), R, lib))) -sparkR.session(master = local[*], sparkConfig = list(spark.driver.memory = 2g)) +sparkR.session(master = local[*], sparkConfig = list(spark.driver.memory = 2g)) @@ -282,14 +282,14 @@ sparkR.session(master = - df - as.DataFrame(faithful) + df - as.DataFrame(faithful) # Displays the first part of the SparkDataFrame head(df) ## eruptions waiting ##1 3.600 79 ##2 1.800 54 -##3 3.333 74 +##3 3.333 74 @@ -303,7 +303,7 @@ specifying --packages with spark-submit or spark - sparkR.session(sparkPackages = com.databricks:spark-avro_2.11:3.0.0) + sparkR.session(sparkPackages = com.databricks:spark-avro_2.11:3.0.0) @@ -311,7 +311,7 @@ specifying --packages with spark-submit or spark - people - read.df(./examples/src/main/resources/people.json, json) + people - read.df(./examples/src/main/resources/people.json, json) head(people) ## agename ##1 NA Michael @@ -325,7 +325,7 @@ printSchema(people) # |-- name: string (nullable = true) # Similarly, multiple files can be read with read.json -people - read.json(c(./examples/src/main/resources/people.json, ./examples/src/main/resources/people2.json)) +people - read.json(c(./examples/src/main/resources/people.json, ./examples/src/main/resources/people2.json)) @@ -333,7 +333,7 @@ people - read.json( - df - read.df(csvPath, csv, header = true, inferSchema = true, na.strings = NA) + df - read.df(csvPath, csv, header
[18/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-tuning.html -- diff --git a/site/docs/2.1.0/ml-tuning.html b/site/docs/2.1.0/ml-tuning.html index 0c36a98..2246cc2 100644 --- a/site/docs/2.1.0/ml-tuning.html +++ b/site/docs/2.1.0/ml-tuning.html @@ -329,13 +329,13 @@ Built-in Cross-Validation and other tooling allow users to optimize hyperparamet Table of contents - Model selection (a.k.a. hyperparameter tuning) - Cross-Validation - Example: model selection via cross-validation + Model selection (a.k.a. hyperparameter tuning) + Cross-Validation + Example: model selection via cross-validation - Train-Validation Split - Example: model selection via train validation split + Train-Validation Split + Example: model selection via train validation split @@ -396,7 +396,7 @@ However, it is also a well-established method for choosing parameters which is m Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for details on the API. -import org.apache.spark.ml.Pipeline +import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.feature.{HashingTF, Tokenizer} @@ -467,7 +467,7 @@ Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark .select(id, text, probability, prediction) .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) = -println(s($id, $text) -- prob=$prob, prediction=$prediction) +println(s($id, $text) -- prob=$prob, prediction=$prediction) } Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala" in the Spark repo. @@ -476,7 +476,7 @@ Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/CrossValidator.html) for details on the API. -import java.util.Arrays; +import java.util.Arrays; import org.apache.spark.ml.Pipeline; import org.apache.spark.ml.PipelineStage; @@ -493,38 +493,38 @@ Refer to the [`CrossValidator` Java docs](api/java/org/apache/spark/ml/tuning/Cr // Prepare training documents, which are labeled. DatasetRow training = spark.createDataFrame(Arrays.asList( - new JavaLabeledDocument(0L, a b c d e spark, 1.0), - new JavaLabeledDocument(1L, b d, 0.0), - new JavaLabeledDocument(2L,spark f g h, 1.0), - new JavaLabeledDocument(3L, hadoop mapreduce, 0.0), - new JavaLabeledDocument(4L, b spark who, 1.0), - new JavaLabeledDocument(5L, g d a y, 0.0), - new JavaLabeledDocument(6L, spark fly, 1.0), - new JavaLabeledDocument(7L, was mapreduce, 0.0), - new JavaLabeledDocument(8L, e spark program, 1.0), - new JavaLabeledDocument(9L, a e c l, 0.0), - new JavaLabeledDocument(10L, spark compile, 1.0), - new JavaLabeledDocument(11L, hadoop software, 0.0) + new JavaLabeledDocument(0L, a b c d e spark, 1.0), + new JavaLabeledDocument(1L, b d, 0.0), + new JavaLabeledDocument(2L,spark f g h, 1.0), + new JavaLabeledDocument(3L, hadoop mapreduce, 0.0), + new JavaLabeledDocument(4L, b spark who, 1.0), + new JavaLabeledDocument(5L, g d a y, 0.0), + new JavaLabeledDocument(6L, spark fly, 1.0), + new JavaLabeledDocument(7L, was mapreduce, 0.0), + new JavaLabeledDocument(8L, e spark program, 1.0), + new JavaLabeledDocument(9L, a e c l, 0.0), + new JavaLabeledDocument(10L, spark compile, 1.0), + new JavaLabeledDocument(11L, hadoop software, 0.0) ), JavaLabeledDocument.class); // Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr. -Tokenizer tokenizer = new Tokenizer() +Tokenizer tokenizer = new Tokenizer() .setInputCol(text) .setOutputCol(words); -HashingTF hashingTF = new HashingTF() +HashingTF hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol()) .setOutputCol(features); -LogisticRegression lr = new LogisticRegression() +LogisticRegression lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01); -Pipeline pipeline = new Pipeline() +Pipeline pipeline = new Pipeline() .setStages(new PipelineStage[] {tokenizer, hashingTF, lr}); // We use a ParamGridBuilder to construct a grid of parameters to search over. // With 3 values for hashingTF.numFeatures and 2 values for lr.regParam, // this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from. -ParamMap[] paramGrid = new ParamGridBuilder() +ParamMap[] paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures(), new int[] {10, 100, 1000}) .addGrid(lr.regParam(), new double[] {0.1, 0.01}) .build(); @@ -534,9 +534,9 @@ Refer to the [`CrossValidator`
[04/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-programming-guide.html -- diff --git a/site/docs/2.1.0/streaming-programming-guide.html b/site/docs/2.1.0/streaming-programming-guide.html index 9a87d23..b1ce1e1 100644 --- a/site/docs/2.1.0/streaming-programming-guide.html +++ b/site/docs/2.1.0/streaming-programming-guide.html @@ -129,32 +129,32 @@ - Overview - A Quick Example - Basic Concepts - Linking - Initializing StreamingContext - Discretized Streams (DStreams) - Input DStreams and Receivers - Transformations on DStreams - Output Operations on DStreams - DataFrame and SQL Operations - MLlib Operations - Caching / Persistence - Checkpointing - Accumulators, Broadcast Variables, and Checkpoints - Deploying Applications - Monitoring Applications + Overview + A Quick Example + Basic Concepts + Linking + Initializing StreamingContext + Discretized Streams (DStreams) + Input DStreams and Receivers + Transformations on DStreams + Output Operations on DStreams + DataFrame and SQL Operations + MLlib Operations + Caching / Persistence + Checkpointing + Accumulators, Broadcast Variables, and Checkpoints + Deploying Applications + Monitoring Applications - Performance Tuning - Reducing the Batch Processing Times - Setting the Right Batch Interval - Memory Tuning + Performance Tuning + Reducing the Batch Processing Times + Setting the Right Batch Interval + Memory Tuning - Fault-tolerance Semantics - Where to Go from Here + Fault-tolerance Semantics + Where to Go from Here Overview @@ -209,7 +209,7 @@ conversions from StreamingContext into our environment in order to add useful me other classes we need (like DStream). StreamingContext is the main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second. -import org.apache.spark._ +import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3 @@ -217,33 +217,33 @@ main entry point for all streaming functionality. We create a local StreamingCon // The master requires 2 cores to prevent from a starvation scenario. val conf = new SparkConf().setMaster(local[2]).setAppName(NetworkWordCount) -val ssc = new StreamingContext(conf, Seconds(1)) +val ssc = new StreamingContext(conf, Seconds(1)) Using this context, we can create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. ). -// Create a DStream that will connect to hostname:port, like localhost: -val lines = ssc.socketTextStream(localhost, ) +// Create a DStream that will connect to hostname:port, like localhost: +val lines = ssc.socketTextStream(localhost, ) This lines DStream represents the stream of data that will be received from the data server. Each record in this DStream is a line of text. Next, we want to split the lines by space characters into words. -// Split each line into words -val words = lines.flatMap(_.split( )) +// Split each line into words +val words = lines.flatMap(_.split( )) flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the words DStream. Next, we want to count these words. -import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3 +import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3 // Count each word in each batch val pairs = words.map(word = (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console -wordCounts.print() +wordCounts.print() The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs, which is then reduced to get the frequency of words in each batch of data. @@ -253,8 +253,8 @@ Finally, wordCounts.print() will print a few of the counts generate will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, we finally call -ssc.start() // Start the computation -ssc.awaitTermination() // Wait for the computation to terminate +ssc.start() // Start the computation +ssc.awaitTermination() // Wait for the computation to terminate The
[15/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-data-types.html -- diff --git a/site/docs/2.1.0/mllib-data-types.html b/site/docs/2.1.0/mllib-data-types.html index 546d921..f7b5358 100644 --- a/site/docs/2.1.0/mllib-data-types.html +++ b/site/docs/2.1.0/mllib-data-types.html @@ -307,14 +307,14 @@ - Local vector - Labeled point - Local matrix - Distributed matrix - RowMatrix - IndexedRowMatrix - CoordinateMatrix - BlockMatrix + Local vector + Labeled point + Local matrix + Distributed matrix + RowMatrix + IndexedRowMatrix + CoordinateMatrix + BlockMatrix @@ -347,14 +347,14 @@ using the factory methods implemented in Refer to the Vector Scala docs and Vectors Scala docs for details on the API. -import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.mllib.linalg.{Vector, Vectors} // Create a dense vector (1.0, 0.0, 3.0). val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) // Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries. -val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) +val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) Note: Scala imports scala.collection.immutable.Vector by default, so you have to import @@ -373,13 +373,13 @@ using the factory methods implemented in Refer to the Vector Java docs and Vectors Java docs for details on the API. -import org.apache.spark.mllib.linalg.Vector; +import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; // Create a dense vector (1.0, 0.0, 3.0). Vector dv = Vectors.dense(1.0, 0.0, 3.0); // Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries. -Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); +Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0}); @@ -405,18 +405,18 @@ in Ve Refer to the Vectors Python docs for more details on the API. -import numpy as np +import numpy as np import scipy.sparse as sps from pyspark.mllib.linalg import Vectors -# Use a NumPy array as a dense vector. +# Use a NumPy array as a dense vector. dv1 = np.array([1.0, 0.0, 3.0]) -# Use a Python list as a dense vector. +# Use a Python list as a dense vector. dv2 = [1.0, 0.0, 3.0] -# Create a SparseVector. +# Create a SparseVector. sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0]) -# Use a single-column SciPy csc_matrix as a sparse vector. -sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1)) +# Use a single-column SciPy csc_matrix as a sparse vector. +sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1)) @@ -438,14 +438,14 @@ For multiclass classification, labels should be class indices starting from zero Refer to the LabeledPoint Scala docs for details on the API. -import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint // Create a labeled point with a positive label and a dense feature vector. val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) // Create a labeled point with a negative label and a sparse feature vector. -val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) +val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) @@ -456,14 +456,14 @@ For multiclass classification, labels should be class indices starting from zero Refer to the LabeledPoint Java docs for details on the API. -import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.mllib.regression.LabeledPoint; // Create a labeled point with a positive label and a dense feature vector. -LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); +LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)); // Create a labeled point with a negative label and a sparse feature vector. -LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0})); +LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] {1.0, 3.0})); @@ -474,14 +474,14 @@ For multiclass classification, labels should be class indices starting from zero Refer to the LabeledPoint Python docs for more details on the API. -from pyspark.mllib.linalg import SparseVector +from pyspark.mllib.linalg
[03/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-kafka-integration.html -- diff --git a/site/docs/2.1.0/structured-streaming-kafka-integration.html b/site/docs/2.1.0/structured-streaming-kafka-integration.html index 5ca9259..7d2254f 100644 --- a/site/docs/2.1.0/structured-streaming-kafka-integration.html +++ b/site/docs/2.1.0/structured-streaming-kafka-integration.html @@ -144,7 +144,7 @@ application. See the Deploying subsection below. -// Subscribe to 1 topic +// Subscribe to 1 topic val ds1 = spark .readStream .format(kafka) @@ -172,12 +172,12 @@ application. See the Deploying subsection below. .option(subscribePattern, topic.*) .load() ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) - .as[(String, String)] + .as[(String, String)] -// Subscribe to 1 topic +// Subscribe to 1 topic DatasetRow ds1 = spark .readStream() .format(kafka) @@ -202,43 +202,43 @@ application. See the Deploying subsection below. .option(kafka.bootstrap.servers, host1:port1,host2:port2) .option(subscribePattern, topic.*) .load() -ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) +ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) -# Subscribe to 1 topic +# Subscribe to 1 topic ds1 = spark .readStream() - .format(kafka) - .option(kafka.bootstrap.servers, host1:port1,host2:port2) - .option(subscribe, topic1) + .format(kafka) + .option(kafka.bootstrap.servers, host1:port1,host2:port2) + .option(subscribe, topic1) .load() -ds1.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) +ds1.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) -# Subscribe to multiple topics +# Subscribe to multiple topics ds2 = spark .readStream - .format(kafka) - .option(kafka.bootstrap.servers, host1:port1,host2:port2) - .option(subscribe, topic1,topic2) + .format(kafka) + .option(kafka.bootstrap.servers, host1:port1,host2:port2) + .option(subscribe, topic1,topic2) .load() -ds2.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) +ds2.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) -# Subscribe to a pattern +# Subscribe to a pattern ds3 = spark .readStream() - .format(kafka) - .option(kafka.bootstrap.servers, host1:port1,host2:port2) - .option(subscribePattern, topic.*) + .format(kafka) + .option(kafka.bootstrap.servers, host1:port1,host2:port2) + .option(subscribePattern, topic.*) .load() -ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) +ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING)) -Each row in the source has the following schema: - +Each row in the source has the following schema: +table class="table" ColumnType key @@ -268,7 +268,7 @@ application. See the Deploying subsection below. timestampType int - +/table The following options must be set for the Kafka source. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[16/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294
http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-collaborative-filtering.html -- diff --git a/site/docs/2.1.0/mllib-collaborative-filtering.html b/site/docs/2.1.0/mllib-collaborative-filtering.html index e453032..b3f9e08 100644 --- a/site/docs/2.1.0/mllib-collaborative-filtering.html +++ b/site/docs/2.1.0/mllib-collaborative-filtering.html @@ -322,13 +322,13 @@ - Collaborative filtering - Explicit vs. implicit feedback - Scaling of the regularization parameter + Collaborative filtering + Explicit vs. implicit feedback + Scaling of the regularization parameter - Examples - Tutorial + Examples + Tutorial Collaborative filtering @@ -393,7 +393,7 @@ recommendation model by measuring the Mean Squared Error of rating prediction.Refer to the ALS Scala docs for more details on the API. -import org.apache.spark.mllib.recommendation.ALS +import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel import org.apache.spark.mllib.recommendation.Rating @@ -434,9 +434,9 @@ recommendation model by measuring the Mean Squared Error of rating prediction.If the rating matrix is derived from another source of information (i.e. it is inferred from other signals), you can use the trainImplicit method to get better results. -val alpha = 0.01 +val alpha = 0.01 val lambda = 0.01 -val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha) +val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha) @@ -449,7 +449,7 @@ that is equivalent to the provided example in Scala is given below: Refer to the ALS Java docs for more details on the API. -import scala.Tuple2; +import scala.Tuple2; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -458,8 +458,8 @@ that is equivalent to the provided example in Scala is given below: import org.apache.spark.mllib.recommendation.Rating; import org.apache.spark.SparkConf; -SparkConf conf = new SparkConf().setAppName(Java Collaborative Filtering Example); -JavaSparkContext jsc = new JavaSparkContext(conf); +SparkConf conf = new SparkConf().setAppName(Java Collaborative Filtering Example); +JavaSparkContext jsc = new JavaSparkContext(conf); // Load and parse the data String path = data/mllib/als/test.data; @@ -468,7 +468,7 @@ that is equivalent to the provided example in Scala is given below: new FunctionString, Rating() { public Rating call(String s) { String[] sarray = s.split(,); - return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]), + return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]), Double.parseDouble(sarray[2])); } } @@ -528,36 +528,36 @@ recommendation by measuring the Mean Squared Error of rating prediction. Refer to the ALS Python docs for more details on the API. -from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating +from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating -# Load and parse the data -data = sc.textFile(data/mllib/als/test.data) -ratings = data.map(lambda l: l.split(,))\ +# Load and parse the data +data = sc.textFile(data/mllib/als/test.data) +ratings = data.map(lambda l: l.split(,))\ .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2]))) -# Build the recommendation model using Alternating Least Squares +# Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 10 model = ALS.train(ratings, rank, numIterations) -# Evaluate the model on training data +# Evaluate the model on training data testdata = ratings.map(lambda p: (p[0], p[1])) predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions) MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean() -print(Mean Squared Error = + str(MSE)) +print(Mean Squared Error = + str(MSE)) -# Save and load model -model.save(sc, target/tmp/myCollaborativeFilter) -sameModel = MatrixFactorizationModel.load(sc, target/tmp/myCollaborativeFilter) +# Save and load model +model.save(sc, target/tmp/myCollaborativeFilter) +sameModel = MatrixFactorizationModel.load(sc, target/tmp/myCollaborativeFilter) Find full example code at "examples/src/main/python/mllib/recommendation_example.py" in the Spark repo. If the rating matrix is derived from other source of information (i.e. it is inferred from other signals), you can use the trainImplicit method to get better results. -# Build the recommendation model using Alternating Least Squares based on implicit ratings -model = ALS.trainImplicit(ratings, rank,
[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16424 @ericl @mallman and @cloud-fan want to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[spark] Git Push Summary
Repository: spark Updated Tags: refs/tags/v2.1.0 [created] cd0a08361 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark pull request #16423: Update known_translations for contributor names a...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/16423 Update known_translations for contributor names and also fix a small issue in translate-contributors.py ## What changes were proposed in this pull request? This PR updates dev/create-release/known_translations to add more contributor name mapping. It also fixes a small issue in translate-contributors.py ## How was this patch tested? n/a You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark contributors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16423 commit 9ce7ef88d98e3aa5d10d761af1967912517a97e1 Author: Yin Huai <yh...@databricks.com> Date: 2016-12-28T18:38:04Z Update contributor list and also fix a small issue in translate-contributors.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer for Data...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16391 Just reverted from master. btw, I think we can keep existing cases and then add new cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset"
Repository: spark Updated Branches: refs/heads/master a05cc425a -> 2404d8e54 Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset" This reverts commit a05cc425a0a7d18570b99883993a04ad175aa071. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2404d8e5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2404d8e5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2404d8e5 Branch: refs/heads/master Commit: 2404d8e54b6b2cfc78d892e7ebb31578457518a3 Parents: a05cc42 Author: Yin HuaiAuthored: Tue Dec 27 10:03:52 2016 -0800 Committer: Yin Huai Committed: Tue Dec 27 10:03:52 2016 -0800 -- .../org/apache/spark/sql/DatasetBenchmark.scala | 75 +--- 1 file changed, 33 insertions(+), 42 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2404d8e5/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala b/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala index cd925e6..66d94d6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala @@ -17,6 +17,7 @@ package org.apache.spark.sql +import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.expressions.Aggregator import org.apache.spark.sql.expressions.scalalang.typed import org.apache.spark.sql.functions._ @@ -33,13 +34,11 @@ object DatasetBenchmark { def backToBackMap(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = { import spark.implicits._ -val rdd = spark.sparkContext.range(0, numRows) -val ds = spark.range(0, numRows) -val df = ds.toDF("l") -val func = (l: Long) => l + 1 - +val df = spark.range(1, numRows).select($"id".as("l"), $"id".cast(StringType).as("s")) val benchmark = new Benchmark("back-to-back map", numRows) +val func = (d: Data) => Data(d.l + 1, d.s) +val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, l.toString)) benchmark.addCase("RDD") { iter => var res = rdd var i = 0 @@ -54,14 +53,14 @@ object DatasetBenchmark { var res = df var i = 0 while (i < numChains) { -res = res.select($"l" + 1 as "l") +res = res.select($"l" + 1 as "l", $"s") i += 1 } res.queryExecution.toRdd.foreach(_ => Unit) } benchmark.addCase("Dataset") { iter => - var res = ds.as[Long] + var res = df.as[Data] var i = 0 while (i < numChains) { res = res.map(func) @@ -76,14 +75,14 @@ object DatasetBenchmark { def backToBackFilter(spark: SparkSession, numRows: Long, numChains: Int): Benchmark = { import spark.implicits._ -val rdd = spark.sparkContext.range(0, numRows) -val ds = spark.range(0, numRows) -val df = ds.toDF("l") -val func = (l: Long, i: Int) => l % (100L + i) == 0L -val funcs = 0.until(numChains).map { i => (l: Long) => func(l, i) } - +val df = spark.range(1, numRows).select($"id".as("l"), $"id".cast(StringType).as("s")) val benchmark = new Benchmark("back-to-back filter", numRows) +val func = (d: Data, i: Int) => d.l % (100L + i) == 0L +val funcs = 0.until(numChains).map { i => + (d: Data) => func(d, i) +} +val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, l.toString)) benchmark.addCase("RDD") { iter => var res = rdd var i = 0 @@ -105,7 +104,7 @@ object DatasetBenchmark { } benchmark.addCase("Dataset") { iter => - var res = ds.as[Long] + var res = df.as[Data] var i = 0 while (i < numChains) { res = res.filter(funcs(i)) @@ -134,29 +133,24 @@ object DatasetBenchmark { def aggregate(spark: SparkSession, numRows: Long): Benchmark = { import spark.implicits._ -val rdd = spark.sparkContext.range(0, numRows) -val ds = spark.range(0, numRows) -val df = ds.toDF("l") - +val df = spark.range(1, numRows).select($"id".as("l"), $"id".cast(StringType).as("s")) val benchmark = new Benchmark("aggregate", numRows) +val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, l.toString)) benchmark.addCase("RDD sum") { iter => - rdd.map(l => (l % 10, l)).reduceByKey(_ + _).foreach(_ => Unit) + rdd.aggregate(0L)(_ + _.l, _ + _) } benchmark.addCase("DataFrame sum") { iter => - df.groupBy($"l" % 10).agg(sum($"l")).queryExecution.toRdd.foreach(_ => Unit) + df.select(sum($"l")).queryExecution.toRdd.foreach(_ => Unit) } benchmark.addCase("Dataset sum using
[GitHub] spark issue #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer for Data...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16391 Seems there is no explicit LGTM. I am reverting this change from master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer f...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16391#discussion_r93955613 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala --- @@ -133,24 +134,29 @@ object DatasetBenchmark { def aggregate(spark: SparkSession, numRows: Long): Benchmark = { import spark.implicits._ -val df = spark.range(1, numRows).select($"id".as("l"), $"id".cast(StringType).as("s")) +val rdd = spark.sparkContext.range(0, numRows) +val ds = spark.range(0, numRows) +val df = ds.toDF("l") + val benchmark = new Benchmark("aggregate", numRows) -val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, l.toString)) benchmark.addCase("RDD sum") { iter => - rdd.aggregate(0L)(_ + _.l, _ + _) + rdd.map(l => (l % 10, l)).reduceByKey(_ + _).foreach(_ => Unit) --- End diff -- Actually, I think we should also have a test case for aggregation without group by. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org