from:"yhuai"

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96566171
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

I think this version is good for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-17 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96549456
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
--- End diff --

ok. Let's throw an exception at here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16628: Update known_translations for contributor names

2017-01-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16628
  
cc @lw-lin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16628: Update known_translations for contributor names

2017-01-17 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/16628

Update known_translations for contributor names

## What changes were proposed in this pull request?
Update known_translations per 
https://github.com/apache/spark/pull/16423#issuecomment-269739634


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark known_translations

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16628.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16628


commit bf2f532a8b7ef60d1542415c0d6dacd8571e7bef
Author: Yin Huai <yh...@databricks.com>
Date:   2017-01-18T00:50:08Z

Update known_translations for contributor names




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16423: Update known_translations for contributor names and also...

2017-01-17 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16423
  
Sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316695
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

This rule is in the same batch with PreprocessTableInsertion, right? If so, 
we cannot guarantee that PreprocessTableInsertion will always fire first for a 
command before InsertIntoTable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317272
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
+val conf = job.getConfiguration
+val tableDesc = fileSinkConf.getTableInfo
+conf.set("mapred.output.format.class", 
tableDesc.getOutputFileFormatClassName)
+
+// Add table properties from storage handler to hadoopConf, so any 
custom storage
+// handler settings can be set to hadoopConf
+HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, 
false)
+Utilities.copyTableJobPropertiesToConf(tableDesc, conf)
+
+// Avoid referencing the outer object.
+val fileSinkConfSer = fileSinkConf
+new OutputWriterFactory {
+  private val jobConf = new SerializableJobConf(new JobConf(conf))
+  @transient private lazy val outputFormat =
+
jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]]
+
+  override def getFileExtension(context: TaskAttemptContext): String = 
{
+Utilities.getFileExtension(jobConf.value, 
fileSinkConfSer.getCompressed, outputFormat)
+  }
+
+  override def newInstance(
+  path: String,
+  dataSchema: StructType,
+  context: TaskAttemptContext): OutputWriter = {
+new HiveOutputWriter(path, fileSinkConfSer, jobConf.value, 
dataSchema)
+  }
+}
--- End diff --

Should we just create a class instead of using an anonymous class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317211
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
+
+  override def prepareWrite(
+  sparkSession: SparkSession,
+  job: Job,
+  options: Map[String, String],
+  dataSchema: StructType): OutputWriterFactory = {
--- End diff --

Want to comment the original source of code in this function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316863
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

Or, you mean that we use this function to determine if 
PreprocessTableInsertion has fired?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316302
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala ---
@@ -108,35 +108,30 @@ class QueryExecution(val sparkSession: SparkSession, 
val logical: LogicalPlan) {
 
 
   /**
-   * Returns the result as a hive compatible sequence of strings.  For 
native commands, the
-   * execution is simply passed back to Hive.
+   * Returns the result as a hive compatible sequence of strings. This is 
for testing only.
*/
   def hiveResultString(): Seq[String] = executedPlan match {
 case ExecutedCommandExec(desc: DescribeTableCommand) =>
-  SQLExecution.withNewExecutionId(sparkSession, this) {
--- End diff --

Explain the reason that `SQLExecution.withNewExecutionId(sparkSession, 
this)` is not needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96316982
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends 
Rule[LogicalPlan] {
   }
 }
 
+class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case InsertIntoTable(table: MetastoreRelation, partSpec, query, 
overwrite, ifNotExists)
+if hasBeenPreprocessed(table.output, 
table.partitionKeys.toStructType, partSpec, query) =>
+  InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+case CreateTable(tableDesc, mode, Some(query)) if 
DDLUtils.isHiveTable(tableDesc) =>
+  // Currently `DataFrameWriter.saveAsTable` doesn't support the 
Append mode of hive serde
+  // tables yet.
+  if (mode == SaveMode.Append) {
+throw new AnalysisException(
+  "CTAS for hive serde tables does not support append semantics.")
+  }
+
+  val dbName = 
tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase)
+  CreateHiveTableAsSelectCommand(
+tableDesc.copy(identifier = tableDesc.identifier.copy(database = 
Some(dbName))),
+query,
+mode == SaveMode.Ignore)
+  }
+
+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been 
preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this 
rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column 
names in partition spec and
--- End diff --

Should this function actually be part of the resolved method of 
InsertIntoTable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16517#discussion_r96317150
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hadoop.hive.ql.exec.Utilities
+import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat}
+import org.apache.hadoop.hive.serde2.Serializer
+import 
org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, 
StructObjectInspector}
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption
+import org.apache.hadoop.io.Writable
+import org.apache.hadoop.mapred.{JobConf, Reporter}
+import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
+
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.execution.datasources.{FileFormat, 
OutputWriter, OutputWriterFactory}
+import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil}
+import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => 
FileSinkDesc}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.SerializableJobConf
+
+/**
+ * `FileFormat` for writing Hive tables.
+ *
+ * TODO: implement the read logic.
+ */
+class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat {
+  override def inferSchema(
+  sparkSession: SparkSession,
+  options: Map[String, String],
+  files: Seq[FileStatus]): Option[StructType] = None
--- End diff --

Is it safe to return None?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...

2017-01-16 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16528
  
looks good to me. If possible, I'd like to get 
https://github.com/apache/spark/pull/16528/files#r96314156 reverted. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16528: [SPARK-19148][SQL] do not expose the external tab...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16528#discussion_r96314156
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala
 ---
@@ -131,17 +131,15 @@ object GenerateOrdering extends 
CodeGenerator[Seq[SortOrder], Ordering[InternalR
   return 0;
 """
   },
-  foldFunctions = { funCalls =>
-funCalls.zipWithIndex.map { case (funCall, i) =>
-  val comp = ctx.freshName("comp")
-  s"""
-int $comp = $funCall;
-if ($comp != 0) {
-  return $comp;
-}
-  """
-}.mkString
-  })
+  foldFunctions = _.map { funCall =>
--- End diff --

How about we revert this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16561: [SPARK-18801][SQL][FOLLOWUP] Alias the view with ...

2017-01-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16561#discussion_r96313495
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -28,22 +28,60 @@ import org.apache.spark.sql.catalyst.rules.Rule
  */
 
 /**
- * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
- * with a Project and add an alias for each output attribute. The 
attributes are resolved by
- * name. This should be only done after the batch of Resolution, because 
the view attributes are
- * not completely resolved during the batch of Resolution.
+ * Make sure that a view's child plan produces the view's output 
attributes. We try to wrap the
+ * child by:
+ * 1. Generate the `queryOutput` by:
+ *1.1. If the query column names are defined, map the column names to 
attributes in the child
+ * output by name(This is mostly for handling view queries like 
SELECT * FROM ..., the
+ * schema of the referenced table/view may change after the view 
has been created, so we
+ * have to save the output of the query to `viewQueryColumnNames`, 
and restore them during
+ * view resolution, in this way, we are able to get the correct 
view column ordering and
+ * omit the extra columns that we don't require);
+ *1.2. Else set the child output attributes to `queryOutput`.
+ * 2. Map the `queryQutput` to view output by index, if the corresponding 
attributes don't match,
+ *try to up cast and alias the attribute in `queryOutput` to the 
attribute in the view output.
+ * 3. Add a Project over the child, with the new output generated by the 
previous steps.
+ * If the view output doesn't have the same number of columns neither with 
the child output, nor
+ * with the query column names, throw an AnalysisException.
+ *
+ * This should be only done after the batch of Resolution, because the 
view attributes are not
+ * completely resolved during the batch of Resolution.
  */
 case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
-case v @ View(_, output, child) if child.resolved =>
+case v @ View(desc, output, child) if child.resolved && output != 
child.output =>
   val resolver = conf.resolver
-  val newOutput = output.map { attr =>
-val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
-// The dataType of the output attributes may be not the same with 
that of the view output,
-// so we should cast the attribute to the dataType of the view 
output attribute. If the
-// cast can't perform, will throw an AnalysisException.
-Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
-  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
+  val queryColumnNames = desc.viewQueryColumnNames
+  val queryOutput = if (queryColumnNames.nonEmpty) {
+// If the view output doesn't have the same number of columns with 
the query column names,
+// throw an AnalysisException.
+if (output.length != queryColumnNames.length) {
+  throw new AnalysisException(
+s"The view output ${output.mkString("[", ",", "]")} doesn't 
have the same number of " +
+  s"columns with the query column names 
${queryColumnNames.mkString("[", ",", "]")}")
+}
+desc.viewQueryColumnNames.map { colName =>
+  findAttributeByName(colName, child.output, resolver)
+}
+  } else {
+// For view created before Spark 2.2.0, the view text is already 
fully qualified, the plan
+// output is the same with the view output.
+child.output
+  }
+  // Map the attributes in the query output to the attributes in the 
view output by index.
+  val newOutput = output.zip(queryOutput).map {
--- End diff --

Seems we need to check the size of `output` and `queryOutput`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16597: [SPARK-19240][SQL] SET LOCATION is not allowed for table...

2017-01-16 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16597
  
I am not sure if it is worth breaking this behavior. If the table is a 
managed table, it is possible that existing behavior allows users to move a 
table from one managed place to another managed place (e.g. the location of a 
database is changed). It is not clear that breaking this behavior can give us 
any benefit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16568: [SPARK-18971][Core]Upgrade Netty to 4.0.43.Final

2017-01-13 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16568
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-12 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16233
  
@jiangxb1987 Once jira is back, let's create jiras to address follow-up 
issues (probably you have already done that before jira went down).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-12 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95896683
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast can't perform, will throw an AnalysisException.
+Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
+  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
--- End diff --

btw, I talked with @hvanhovell, the tricky case is when the view definition 
query is `select *`. So, we need first result the query and put a mapping from 
the view column name to the query output. Then, when we read the view back, 
even the column ordering of `select *` query is changed, we can still get the 
correct view column ordering.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-11 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16233
  
left https://github.com/apache/spark/pull/16233/files#r95669988 and 
https://github.com/apache/spark/pull/16233/files#r95662299. I think we need to 
address them before we switch the code path to this new approach. 

But, we can still merge this patch and address those in follow-up prs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-11 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95669988
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast can't perform, will throw an AnalysisException.
+Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
+  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
--- End diff --

If the data type does not match, I feel we should throw an exception 
instead of just adding the cast at here. 

When the data type is changed, the meaning of the column may be changed. 
So, the user who defines the view may need to make some actions like modifying 
the applications using the view. I feel throwing an exception is better because 
it lets users know that the column data type is changed and some actions are 
required. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-11 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95662299
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the batch of Resolution, because 
the view attributes are
+ * not completely resolved during the batch of Resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = output.map { attr =>
+val originAttr = findAttributeByName(attr.name, child.output, 
resolver)
+// The dataType of the output attributes may be not the same with 
that of the view output,
+// so we should cast the attribute to the dataType of the view 
output attribute. If the
+// cast can't perform, will throw an AnalysisException.
+Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = 
attr.exprId,
+  qualifier = attr.qualifier, explicitMetadata = 
Some(attr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
--- End diff --

Sorry, I may miss something obvious. But, I am still not sure it is the 
right thing to do.

I tried the following in the postgres

```
yhuai=# create table testbase (a int, b int, c int);
CREATE TABLE
yhuai=# insert into testbase values (1, 2, 3)  ;
INSERT 0 1
yhuai=# insert into testbase values (4, 5, 6);  
   
INSERT 0 1
yhuai=# create view testview (b, c, a) as select a, b, c from testbase;
CREATE VIEW
yhuai=# select * from testview;
 b | c | a 
---+---+---
 1 | 2 | 3
 4 | 5 | 6
    (2 rows)

yhuai=# create view testview1 (b, c, a) as select * from testbase;
CREATE VIEW
yhuai=# select * from testview1;
 b | c | a 
---+---+---
 1 | 2 | 3
 4 | 5 | 6
(2 rows)

```

I am not sure why we are resolving those columns by name.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16521: [SPARK-19139][core] New auth mechanism for transport lib...

2017-01-10 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16521
  
@vanzin I have not reviewed this PR yet. Just have two level questions. Is 
there any change to existing behaviors and settings (compared with Spark 2.1)? 
Also, does our doc have enough contents to explain how to set those confs and 
how those work? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95427571
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name(In this case, no
+//temporary objects can be used, and the default database is only 
used to look up a view);
+// 3. Else use the currentDb of the SessionCatalog.
+private def lookupTableFromCatalog(
+u: UnresolvedRelation,
+defaultDatabase: Option[String] = None): LogicalPlan = {
   try {
-catalog.lookupRelation(u.tableIdentifier, u.alias)
+val tableIdentWithDb = 
u.tableIdentifier.withDataba

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95427132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // The view's child should be a logical plan parsed from the 
`desc.viewText`, the variable
+  // `viewText` should be defined, or else we throw an error on the 
generation of the View
+  // operator.
+  case view @ View(desc, _, child) if !child.resolved =>
+val nestedViewLevel = AnalysisContext.get.nestedViewLevel + 1
+val context = AnalysisContext(defaultDatabase = 
desc.viewDefaultDatabase,
+  nestedViewLevel = nestedViewLevel)
+// Resolve all the UnresolvedRelations and Views in the child.
+val newChild = AnalysisContext.withAnalysisContext(context) {
+  execute(child)
+}
+view.copy(child = newChild)
+  case p @ SubqueryAlias(_, view: View, _) =>
+val newChild = resolveRelation(view, defaultDatabase)
+p.copy(child = newChild)
+  case _ => plan
+}
+
+def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+  case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) 
if child.resolved =>
+i.copy(table = EliminateSubqueryAliases(lookupTableFromCatalog(u)))
+  case u: UnresolvedRelation => resolveRelation(u)
+}
+
+// Look up the table with the given name from catalog. The database we 
look up the table from
+// is decided follow the steps:
+// 1. If the database part is defined in the table identifier, use 
that database name;
+// 2. Else If the defaultDatabase is defined, use the default database 
name(In this case, no
+//temporary objects can be used, and the default database is only 
used to look up a view);
+// 3. Else use the currentDb of the SessionCatalog.
+private def lookupTableFromCatalog(
+u: UnresolvedRelation,
+defaultDatabase: Option[String] = None): LogicalPlan = {
   try {
-catalog.lookupRelation(u.tableIdentifier, u.alias)
+val tableIdentWithDb = 
u.tableIdentifier.withDataba

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95426835
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +542,90 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u: UnresolvedRelation if 
!isRunningDirectlyOnFiles(u.tableIdentifier) =>
+val defaultDatabase = AnalysisContext.get.defaultDatabase
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
--- End diff --

We are calling resolveRelation again because `relation` may be a view?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95421315
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -767,19 +857,19 @@ class Analyzer(
 }
   }
 
- /**
-  * In many dialects of SQL it is valid to use ordinal positions in 
order/sort by and group by
-  * clauses. This rule is to convert ordinal positions to the 
corresponding expressions in the
-  * select list. This support is introduced in Spark 2.0.
-  *
-  * - When the sort references or group by expressions are not integer but 
foldable expressions,
-  * just ignore them.
-  * - When spark.sql.orderByOrdinal/spark.sql.groupByOrdinal is set to 
false, ignore the position
-  * numbers too.
-  *
-  * Before the release of Spark 2.0, the literals in order/sort by and 
group by clauses
-  * have no effect on the results.
-  */
+  /**
+   * In many dialects of SQL it is valid to use ordinal positions in 
order/sort by and group by
+   * clauses. This rule is to convert ordinal positions to the 
corresponding expressions in the
+   * select list. This support is introduced in Spark 2.0.
+   *
+   * - When the sort references or group by expressions are not integer 
but foldable expressions,
+   * just ignore them.
+   * - When spark.sql.orderByOrdinal/spark.sql.groupByOrdinal is set to 
false, ignore the position
+   * numbers too.
+   *
+   * Before the release of Spark 2.0, the literals in order/sort by and 
group by clauses
+   * have no effect on the results.
+   */
--- End diff --

what's going on with these lines?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95425724
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -619,7 +642,11 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 
 var table = inputTable
 
-if (table.tableType != VIEW) {
+if (table.tableType == VIEW) {
+  // Read view default database from table properties.
+  val viewDefaultDatabase = table.properties.get(VIEW_DEFAULT_DATABASE)
--- End diff --

We may not have VIEW_DEFAULT_DATABASE in the properties for views that are 
defined in older versions of spark, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95422202
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
--- End diff --

When will newAttr and attr have different data types?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95424456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
--- End diff --

Since this function is used once, I'd inline it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95424381
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
--- End diff --

Is there any chance that output and child have different sizes? Also, looks 
like we are trying to match columns by name, can you explain the reason? Why we 
are not matching columns by the position? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Support resolve a nested view

2017-01-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r95424897
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala 
---
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.analysis
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.CatalystConf
+import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, Cast}
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project, 
View}
+import org.apache.spark.sql.catalyst.rules.Rule
+
+/**
+ * This file defines analysis rules related to views.
+ */
+
+/**
+ * Make sure that a view's child plan produces the view's output 
attributes. We wrap the child
+ * with a Project and add an alias for each output attribute. The 
attributes are resolved by
+ * name. This should be only done after the resolution batch, because the 
view attributes are
+ * not stable during resolution.
+ */
+case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case v @ View(_, output, child) if child.resolved =>
+  val resolver = conf.resolver
+  val newOutput = child.output.map { attr =>
+val newAttr = findAttributeByName(attr.name, output, resolver)
+// Check the dataType of the output attributes, throw an 
AnalysisException if they don't
+// match up.
+checkDataType(attr, newAttr)
+Alias(attr, attr.name)(exprId = newAttr.exprId, qualifier = 
newAttr.qualifier,
+  explicitMetadata = Some(newAttr.metadata))
+  }
+  v.copy(child = Project(newOutput, child))
+  }
+
+  /**
+   * Find the attribute that has the expected attribute name from an 
attribute list, the names
+   * are compared using conf.resolver.
+   * If the expected attribute is not found, throw an AnalysisException.
+   */
+  private def findAttributeByName(
+  name: String,
+  attrs: Seq[Attribute],
+  resolver: Resolver): Attribute = {
+attrs.collectFirst {
+  case attr if resolver(attr.name, name) => attr
+}.getOrElse(throw new AnalysisException(
+  s"Attribute with name '$name' is not found in " +
+s"'${attrs.map(_.name).mkString("(", ",", ")")}'"))
+  }
+
+  /**
+   * Check whether the dataType of `attr` could be casted to that of 
`other`, throw an
+   * AnalysisException if the both attributes don't match up.
+   */
+  private def checkDataType(attr: Attribute, other: Attribute): Unit = {
+if (!Cast.canCast(attr.dataType, other.dataType)) {
--- End diff --

btw, I do not say where we actually do the cast.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16487: [SPARK-19107][SQL] support creating hive table with Data...

2017-01-09 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16487
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2017-01-05 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16296
  
Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables

2017-01-05 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master f5d18af6a -> cca945b6a


[SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde 
tables

## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we 
should unify them to not confuse users and step forward to make hive a data 
source.

Please read 
https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  
details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna 
add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan 

Closes #16296 from cloud-fan/create-table.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cca945b6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cca945b6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cca945b6

Branch: refs/heads/master
Commit: cca945b6aa679e61864c1cabae91e6ae7703362e
Parents: f5d18af
Author: Wenchen Fan 
Authored: Thu Jan 5 17:40:27 2017 -0800
Committer: Yin Huai 
Committed: Thu Jan 5 17:40:27 2017 -0800

--
 docs/sql-programming-guide.md   |  60 +--
 .../examples/sql/hive/JavaSparkHiveExample.java |   2 +-
 examples/src/main/python/sql/hive.py|   2 +-
 examples/src/main/r/RSparkSQLExample.R  |   2 +-
 .../examples/sql/hive/SparkHiveExample.scala|   2 +-
 .../apache/spark/sql/catalyst/parser/SqlBase.g4 |  10 +-
 .../sql/catalyst/util/CaseInsensitiveMap.scala  |   2 +
 .../spark/sql/execution/SparkSqlParser.scala|  57 +-
 .../spark/sql/execution/SparkStrategies.scala   |   6 +-
 .../spark/sql/execution/command/ddl.scala   |   6 +-
 .../spark/sql/execution/datasources/rules.scala |   7 +-
 .../apache/spark/sql/internal/HiveSerDe.scala   |  84 +--
 .../sql/execution/SparkSqlParserSuite.scala |   3 +-
 .../sql/execution/command/DDLCommandSuite.scala |  79 +++---
 .../spark/sql/hive/HiveExternalCatalog.scala|   4 +-
 .../spark/sql/hive/HiveSessionState.scala   |   1 +
 .../apache/spark/sql/hive/HiveStrategies.scala  |  73 +++--
 .../spark/sql/hive/execution/HiveOptions.scala  | 102 ++
 .../spark/sql/hive/orc/OrcFileOperator.scala|   2 +-
 .../spark/sql/hive/HiveDDLCommandSuite.scala| 107 +--
 .../sql/hive/HiveExternalCatalogSuite.scala |   2 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala|  15 ---
 .../spark/sql/hive/execution/HiveDDLSuite.scala |  39 +++
 .../sql/hive/orc/OrcHadoopFsRelationSuite.scala |   2 -
 24 files changed, 526 insertions(+), 143 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cca945b6/docs/sql-programming-guide.md
--
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 4cd21ae..0f6e344 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -522,14 +522,11 @@ Hive metastore. Persistent tables will still exist even 
after your Spark program
 long as you maintain your connection to the same metastore. A DataFrame for a 
persistent table can
 be created by calling the `table` method on a `SparkSession` with the name of 
the table.
 
-By default `saveAsTable` will create a "managed table", meaning that the 
location of the data will
-be controlled by the metastore. Managed tables will also have their data 
deleted automatically
-when a table is dropped.
-
-Currently, `saveAsTable` does not expose an API supporting the creation of an 
"external table" from a `DataFrame`.
-However, this functionality can be achieved by providing a `path` option to 
the `DataFrameWriter` with `path` as the key
-and location of the external table as its value (a string) when saving the 
table with `saveAsTable`. When an External table
-is dropped only its metadata is removed.
+For file-based data source, e.g. text, parquet, json, etc. you can specify a 
custom table path via the
+`path` option, e.g. `df.write.option("path", "/some/path").saveAsTable("t")`. 
When the table is dropped,
+the custom table path will not be removed and the table data is still there. 
If no custom table path is
+specifed, Spark will write data to a default table path under the warehouse 
directory. When the table is
+dropped, the default table path will be removed too.
 
 Starting from Spark 2.1, persistent datasource tables have per-partition 
metadata stored in the Hive metastore. This brings several benefits:
 
@@ -954,6 +951,53 @@ adds

[GitHub] spark issue #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for data so...

2017-01-05 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16296
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-05 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94884267
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/HiveSerDe.scala ---
@@ -77,4 +79,16 @@ object HiveSerDe {
 
 serdeMap.get(key)
   }
+
+  def getDefaultStorage(conf: SQLConf): CatalogStorageFormat = {
+val defaultStorageType = conf.getConfString("hive.default.fileformat", 
"textfile")
+val defaultHiveSerde = sourceToSerDe(defaultStorageType)
+CatalogStorageFormat.empty.copy(
+  inputFormat = defaultHiveSerde.flatMap(_.inputFormat)
+.orElse(Some("org.apache.hadoop.mapred.TextInputFormat")),
+  outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
+
.orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
+  serde = defaultHiveSerde.flatMap(_.serde)
+
.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")))
--- End diff --

Does this version break any test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16479: [SPARK-19085][SQL] cleanup OutputWriterFactory and Outpu...

2017-01-05 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16479
  
What is the benefit of making these changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[1/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted

2017-01-04 Thread yhuai

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 426a68ba8 -> 46a7a8027


http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/1-first-steps-with-spark.html
--
diff --git a/site/screencasts/1-first-steps-with-spark.html 
b/site/screencasts/1-first-steps-with-spark.html
index ac30748..8bcd8bb 100644
--- a/site/screencasts/1-first-steps-with-spark.html
+++ b/site/screencasts/1-first-steps-with-spark.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/2-spark-documentation-overview.html
--
diff --git a/site/screencasts/2-spark-documentation-overview.html 
b/site/screencasts/2-spark-documentation-overview.html
index b331b25..6d8f46e 100644
--- a/site/screencasts/2-spark-documentation-overview.html
+++ b/site/screencasts/2-spark-documentation-overview.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/3-transformations-and-caching.html
--
diff --git a/site/screencasts/3-transformations-and-caching.html 
b/site/screencasts/3-transformations-and-caching.html
index 7ab50f5..f7aa6b4 100644
--- a/site/screencasts/3-transformations-and-caching.html
+++ b/site/screencasts/3-transformations-and-caching.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/4-a-standalone-job-in-spark.html
--
diff --git a/site/screencasts/4-a-standalone-job-in-spark.html 
b/site/screencasts/4-a-standalone-job-in-spark.html
index 35cf6f0..d6cb311 100644
--- a/site/screencasts/4-a-standalone-job-in-spark.html
+++ b/site/screencasts/4-a-standalone-job-in-spark.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/screencasts/index.html
--
diff --git a/site/screencasts/index.html b/site/screencasts/index.html
index bd9d33e..df951b8 100644
--- a/site/screencasts/index.html
+++ b/site/screencasts/index.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/sitemap.xml
--
diff --git a/site/sitemap.xml b/site/sitemap.xml
index 47ed71f..1ed4c74 100644
--- a/site/sitemap.xml
+++ b/site/sitemap.xml
@@ -139,6 +139,10 @@
 
 
 
+  
http://spark.apache.org/news/spark-summit-east-2017-agenda-posted.html
+  weekly
+
+
   http://spark.apache.org/releases/spark-release-2-1-0.html
   weekly
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/sql/index.html
--
diff --git a/site/sql/index.html

[3/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted

2017-01-04 Thread yhuai

Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/46a7a802
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/46a7a802
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/46a7a802

Branch: refs/heads/asf-site
Commit: 46a7a802762fa2428265b422170821fc3fec3563
Parents: 426a68b
Author: Yin Huai 
Authored: Wed Jan 4 18:27:47 2017 -0800
Committer: Yin Huai 
Committed: Wed Jan 4 18:27:47 2017 -0800

--
 ...1-04-spark-summit-east-2017-agenda-posted.md |  15 ++
 site/committers.html|   6 +-
 site/community.html |   6 +-
 site/contributing.html  |   6 +-
 site/developer-tools.html   |   6 +-
 site/documentation.html |   6 +-
 site/downloads.html |   6 +-
 site/examples.html  |   6 +-
 site/faq.html   |   6 +-
 site/graphx/index.html  |   6 +-
 site/index.html |   6 +-
 site/mailing-lists.html |   6 +-
 site/mllib/index.html   |   6 +-
 site/news/amp-camp-2013-registration-ope.html   |   6 +-
 .../news/announcing-the-first-spark-summit.html |   6 +-
 .../news/fourth-spark-screencast-published.html |   6 +-
 site/news/index.html|  15 +-
 site/news/nsdi-paper.html   |   6 +-
 site/news/one-month-to-spark-summit-2015.html   |   6 +-
 .../proposals-open-for-spark-summit-east.html   |   6 +-
 ...registration-open-for-spark-summit-east.html |   6 +-
 .../news/run-spark-and-shark-on-amazon-emr.html |   6 +-
 site/news/spark-0-6-1-and-0-5-2-released.html   |   6 +-
 site/news/spark-0-6-2-released.html |   6 +-
 site/news/spark-0-7-0-released.html |   6 +-
 site/news/spark-0-7-2-released.html |   6 +-
 site/news/spark-0-7-3-released.html |   6 +-
 site/news/spark-0-8-0-released.html |   6 +-
 site/news/spark-0-8-1-released.html |   6 +-
 site/news/spark-0-9-0-released.html |   6 +-
 site/news/spark-0-9-1-released.html |   6 +-
 site/news/spark-0-9-2-released.html |   6 +-
 site/news/spark-1-0-0-released.html |   6 +-
 site/news/spark-1-0-1-released.html |   6 +-
 site/news/spark-1-0-2-released.html |   6 +-
 site/news/spark-1-1-0-released.html |   6 +-
 site/news/spark-1-1-1-released.html |   6 +-
 site/news/spark-1-2-0-released.html |   6 +-
 site/news/spark-1-2-1-released.html |   6 +-
 site/news/spark-1-2-2-released.html |   6 +-
 site/news/spark-1-3-0-released.html |   6 +-
 site/news/spark-1-4-0-released.html |   6 +-
 site/news/spark-1-4-1-released.html |   6 +-
 site/news/spark-1-5-0-released.html |   6 +-
 site/news/spark-1-5-1-released.html |   6 +-
 site/news/spark-1-5-2-released.html |   6 +-
 site/news/spark-1-6-0-released.html |   6 +-
 site/news/spark-1-6-1-released.html |   6 +-
 site/news/spark-1-6-2-released.html |   6 +-
 site/news/spark-1-6-3-released.html |   6 +-
 site/news/spark-2-0-0-released.html |   6 +-
 site/news/spark-2-0-1-released.html |   6 +-
 site/news/spark-2-0-2-released.html |   6 +-
 site/news/spark-2-1-0-released.html |   6 +-
 site/news/spark-2.0.0-preview.html  |   6 +-
 .../spark-accepted-into-apache-incubator.html   |   6 +-
 site/news/spark-and-shark-in-the-news.html  |   6 +-
 site/news/spark-becomes-tlp.html|   6 +-
 site/news/spark-featured-in-wired.html  |   6 +-
 .../spark-mailing-lists-moving-to-apache.html   |   6 +-
 site/news/spark-meetups.html|   6 +-
 site/news/spark-screencasts-published.html  |   6 +-
 site/news/spark-summit-2013-is-a-wrap.html  |   6 +-
 site/news/spark-summit-2014-videos-posted.html  |   6 +-
 site/news/spark-summit-2015-videos-posted.html  |   6 +-
 site/news/spark-summit-agenda-posted.html   |   6 +-
 .../spark-summit-east-2015-videos-posted.html   |   6 +-
 .../spark-summit-east-2016-cfp-closing.html |   6 +-
 .../spark-summit-east-2017-agenda-posted.html   | 220 +++
 site/news/spark-summit-east-agenda-posted.html  |   6 +-
 .../news/spark-summit-europe-agenda-posted.html |   6 +-
 site/news/spark-summit-europe.html  |   6 +-
 .../spark-summit-june-2016-agenda-posted.html   |   6 +-
 site/news/spark-tips-from-quantifind.html   |   6 +-

[2/3] spark-website git commit: Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted

2017-01-04 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-2014-videos-posted.html
--
diff --git a/site/news/spark-summit-2014-videos-posted.html 
b/site/news/spark-summit-2014-videos-posted.html
index 3efd1db..03cbcd3 100644
--- a/site/news/spark-summit-2014-videos-posted.html
+++ b/site/news/spark-summit-2014-videos-posted.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-2015-videos-posted.html
--
diff --git a/site/news/spark-summit-2015-videos-posted.html 
b/site/news/spark-summit-2015-videos-posted.html
index 8aed6ba..1a93256 100644
--- a/site/news/spark-summit-2015-videos-posted.html
+++ b/site/news/spark-summit-2015-videos-posted.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-agenda-posted.html
--
diff --git a/site/news/spark-summit-agenda-posted.html 
b/site/news/spark-summit-agenda-posted.html
index 2697ece..354035f 100644
--- a/site/news/spark-summit-agenda-posted.html
+++ b/site/news/spark-summit-agenda-posted.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2015-videos-posted.html
--
diff --git a/site/news/spark-summit-east-2015-videos-posted.html 
b/site/news/spark-summit-east-2015-videos-posted.html
index 84771e8..962aa1e 100644
--- a/site/news/spark-summit-east-2015-videos-posted.html
+++ b/site/news/spark-summit-east-2015-videos-posted.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2016-cfp-closing.html
--
diff --git a/site/news/spark-summit-east-2016-cfp-closing.html 
b/site/news/spark-summit-east-2016-cfp-closing.html
index 45e6385..cc43c32 100644
--- a/site/news/spark-summit-east-2016-cfp-closing.html
+++ b/site/news/spark-summit-east-2016-cfp-closing.html
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 
Summit East (Feb 7-9th, 2017, Boston) agenda posted
+  (Jan 04, 2017)
+
   Spark 2.1.0 
released
   (Dec 28, 2016)
 
@@ -168,9 +171,6 @@
   Spark 2.0.2 
released
   (Nov 14, 2016)
 
-  Spark 1.6.3 
released
-  (Nov 07, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/46a7a802/site/news/spark-summit-east-2017-agenda-posted.html
--
diff --git a/site/news/spark-summit-east-2017-agenda-posted.html 
b/site/news/spark-summit-east-2017-agenda-posted.html
new file mode 100644
index 000..58af016
--- /dev/null
+++ b/site/news/spark-summit-east-2017-agenda-posted.html
@@ -0,0 +1,220 @@
+
+
+
+  
+  
+  
+
+  
+ Spark Summit East (Feb 7-9th, 2017, Boston) agenda posted | Apache Spark
+
+  
+
+  
+
+  
+
+  
+  
+  
+
+  
+  
+
+  
+  
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-32518208-2']);
+

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94699866
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -18,14 +18,79 @@
 package org.apache.spark.sql.hive
 
 import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.catalog.CatalogStorageFormat
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.planning._
 import org.apache.spark.sql.catalyst.plans._
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.rules.Rule
 import org.apache.spark.sql.execution._
-import org.apache.spark.sql.execution.command.ExecutedCommandExec
+import org.apache.spark.sql.execution.command.{DDLUtils, 
ExecutedCommandExec}
 import org.apache.spark.sql.execution.datasources.CreateTable
 import org.apache.spark.sql.hive.execution._
+import org.apache.spark.sql.internal.{HiveSerDe, SQLConf}
+
+
+/**
+ * Determine the serde/format of the Hive serde table, according to the 
storage properties.
+ */
+class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan 
resolveOperators {
+case c @ CreateTable(t, _, _) if DDLUtils.isHiveTable(t) && 
t.storage.inputFormat.isEmpty =>
+  if (t.bucketSpec.nonEmpty) {
+throw new AnalysisException("Cannot create bucketed Hive serde 
table.")
+  }
+
+  val defaultStorage: CatalogStorageFormat = {
+val defaultStorageType = 
conf.getConfString("hive.default.fileformat", "textfile")
+val defaultHiveSerde = HiveSerDe.sourceToSerDe(defaultStorageType)
+CatalogStorageFormat(
+  locationUri = None,
+  inputFormat = defaultHiveSerde.flatMap(_.inputFormat)
+.orElse(Some("org.apache.hadoop.mapred.TextInputFormat")),
+  outputFormat = defaultHiveSerde.flatMap(_.outputFormat)
+
.orElse(Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat")),
+  serde = defaultHiveSerde.flatMap(_.serde)
+
.orElse(Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe")),
+  compressed = false,
+  properties = Map())
+  }
+
+  val options = new HiveOptions(t.storage.properties)
+
+  val fileStorage = if (options.format.isDefined) {
+HiveSerDe.sourceToSerDe(options.format.get) match {
+  case Some(s) =>
+CatalogStorageFormat.empty.copy(
+  inputFormat = s.inputFormat,
+  outputFormat = s.outputFormat,
+  serde = s.serde)
+  case None =>
+throw new IllegalArgumentException(s"invalid format: 
'${options.format.get}'")
+}
+  } else if (options.inputFormat.isDefined) {
--- End diff --

Maybe we should use a helper function to know if inputFormat and 
outputFormat are set? The current version assumes that the reader know the 
internal of HiveOptions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94700062
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala ---
@@ -592,4 +597,77 @@ class HiveDDLCommandSuite extends PlanTest with 
SQLTestUtils with TestHiveSingle
 val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
 assert(hiveClient.getConf("hive.in.test", "") == "true")
   }
+
+  test("create hive serde table with new syntax - basic") {
+val sql =
+  """
+|CREATE TABLE t
+|(id int, name string COMMENT 'blabla')
+|USING hive
+|OPTIONS (format 'parquet', my_prop 1)
+|LOCATION '/tmp/file'
+|COMMENT 'BLABLA'
+  """.stripMargin
+
+val table = analyzeCreateTable(sql)
+assert(table.schema == new StructType()
+  .add("id", "int")
+  .add("name", "string", nullable = true, comment = "blabla"))
+assert(table.provider == Some(DDLUtils.HIVE_PROVIDER))
+assert(table.storage.locationUri == Some("/tmp/file"))
+assert(table.storage.properties == Map("my_prop" -> "1"))
+assert(table.comment == Some("BLABLA"))
+
+assert(table.storage.inputFormat ==
+  
Some("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"))
+assert(table.storage.outputFormat ==
+  
Some("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"))
+assert(table.storage.serde ==
+  Some("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"))
+  }
+
+  test("create hive serde table with new syntax - with partition and 
bucketing") {
+val v1 = "CREATE TABLE t (c1 int, c2 int) USING hive PARTITIONED BY 
(c2)"
+val table = analyzeCreateTable(v1)
+assert(table.schema == new StructType().add("c1", "int").add("c2", 
"int"))
+assert(table.partitionColumnNames == Seq("c2"))
+// check the default formats
+assert(table.storage.serde == 
Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
+assert(table.storage.inputFormat == 
Some("org.apache.hadoop.mapred.TextInputFormat"))
+assert(table.storage.outputFormat ==
+  Some("org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"))
+
+val v2 = "CREATE TABLE t (c1 int, c2 int) USING hive CLUSTERED BY (c2) 
INTO 4 BUCKETS"
+val e = intercept[AnalysisException](analyzeCreateTable(v2))
+assert(e.message.contains("Cannot create bucketed Hive serde table"))
--- End diff --

Let's also have a test using both partitioning and bucketing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94698030
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
   }
 
   /**
-   * Create a data source table, returning a [[CreateTable]] logical plan.
+   * Create a table, returning a [[CreateTable]] logical plan.
*
* Expected format:
* {{{
-   *   CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
+   *   CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
*   USING table_provider
*   [OPTIONS table_property_list]
*   [PARTITIONED BY (col_name, col_name, ...)]
*   [CLUSTERED BY (col_name, col_name, ...)
*[SORTED BY (col_name [ASC|DESC], ...)]
*INTO num_buckets BUCKETS
*   ]
+   *   [LOCATION path]
+   *   [COMMENT table_comment]
*   [AS select_statement];
* }}}
*/
-  override def visitCreateTableUsing(ctx: CreateTableUsingContext): 
LogicalPlan = withOrigin(ctx) {
+  override def visitCreateTable(ctx: CreateTableContext): LogicalPlan = 
withOrigin(ctx) {
 val (table, temp, ifNotExists, external) = 
visitCreateTableHeader(ctx.createTableHeader)
 if (external) {
   operationNotAllowed("CREATE EXTERNAL TABLE ... USING", ctx)
 }
-val options = 
Option(ctx.tablePropertyList).map(visitPropertyKeyValues).getOrElse(Map.empty)
+val options = 
Option(ctx.options).map(visitPropertyKeyValues).getOrElse(Map.empty)
 val provider = ctx.tableProvider.qualifiedName.getText
-if (provider.toLowerCase == DDLUtils.HIVE_PROVIDER) {
-  throw new AnalysisException("Cannot create hive serde table with 
CREATE TABLE USING")
-}
 val schema = Option(ctx.colTypeList()).map(createSchema)
 val partitionColumnNames =
   Option(ctx.partitionColumnNames)
 .map(visitIdentifierList(_).toArray)
 .getOrElse(Array.empty[String])
 val bucketSpec = Option(ctx.bucketSpec()).map(visitBucketSpec)
 
-// TODO: this may be wrong for non file-based data source like JDBC, 
which should be external
-// even there is no `path` in options. We should consider allow the 
EXTERNAL keyword.
+val location = Option(ctx.locationSpec).map(visitLocationSpec)
 val storage = DataSource.buildStorageFormatFromOptions(options)
-val tableType = if (storage.locationUri.isDefined) {
+
+if (location.isDefined && storage.locationUri.isDefined) {
+  throw new ParseException("Cannot specify LOCATION when there is 
'path' in OPTIONS.", ctx)
--- End diff --

Let's be more specific at here. These two approaches are the same and we 
only want users to use one, right? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94699288
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+/**
+ * Options for the Hive data source.
+ */
+class HiveOptions(@transient private val parameters: CaseInsensitiveMap) 
extends Serializable {
+  import HiveOptions._
+
+  def this(parameters: Map[String, String]) = this(new 
CaseInsensitiveMap(parameters))
+
+  val format = parameters.get(FORMAT).map(_.toLowerCase)
--- End diff --

file format?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94699572
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+/**
+ * Options for the Hive data source.
+ */
+class HiveOptions(@transient private val parameters: CaseInsensitiveMap) 
extends Serializable {
+  import HiveOptions._
+
+  def this(parameters: Map[String, String]) = this(new 
CaseInsensitiveMap(parameters))
+
+  val format = parameters.get(FORMAT).map(_.toLowerCase)
+  val inputFormat = parameters.get(INPUT_FORMAT)
+  val outputFormat = parameters.get(OUTPUT_FORMAT)
+
+  if (inputFormat.isDefined != outputFormat.isDefined) {
+throw new IllegalArgumentException("Cannot specify only inputFormat or 
outputFormat, you " +
+  "have to specify both of them.")
+  }
+
+  if (format.isDefined && inputFormat.isDefined) {
+throw new IllegalArgumentException("Cannot specify format and 
inputFormat/outputFormat " +
+  "together for Hive data source.")
+  }
+
+  val serde = parameters.get(SERDE)
+
+  for (f <- format if serde.isDefined) {
--- End diff --

maybe using if is easier to read?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94699072
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala 
---
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+/**
+ * Options for the Hive data source.
+ */
+class HiveOptions(@transient private val parameters: CaseInsensitiveMap) 
extends Serializable {
--- End diff --

Let's also mention that DetermineHiveSerde will fill in default values 
based on the file format. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94700334
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -1198,4 +1198,23 @@ class HiveDDLSuite
   assert(e.message.contains("unknown is not a valid partition column"))
 }
   }
+
+  test("create hive serde table with new syntax") {
+withTable("t", "t2") {
+  sql("CREATE TABLE t(id int) USING hive OPTIONS(format 'orc')")
+  val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+  assert(DDLUtils.isHiveTable(table))
+  assert(table.storage.serde == 
Some("org.apache.hadoop.hive.ql.io.orc.OrcSerde"))
+  assert(spark.table("t").collect().isEmpty)
+
+  sql("INSERT INTO t SELECT 1")
+  checkAnswer(spark.table("t"), Row(1))
+
+  sql("CREATE TABLE t2 USING HIVE AS SELECT 1 AS c1, 'a' AS c2")
+  val table2 = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t2"))
+  assert(DDLUtils.isHiveTable(table2))
+  assert(table2.storage.serde == 
Some("org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"))
+  checkAnswer(spark.table("t2"), Row(1, "a"))
+}
+  }
--- End diff --

Let's also exercise partitioning.

Let's also test a orc's option. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94697902
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
   }
 
   /**
-   * Create a data source table, returning a [[CreateTable]] logical plan.
+   * Create a table, returning a [[CreateTable]] logical plan.
*
* Expected format:
* {{{
-   *   CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
+   *   CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
--- End diff --

oh it is a typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16296: [SPARK-18885][SQL] unify CREATE TABLE syntax for ...

2017-01-04 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16296#discussion_r94697485
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala ---
@@ -342,42 +342,46 @@ class SparkSqlAstBuilder(conf: SQLConf) extends 
AstBuilder {
   }
 
   /**
-   * Create a data source table, returning a [[CreateTable]] logical plan.
+   * Create a table, returning a [[CreateTable]] logical plan.
*
* Expected format:
* {{{
-   *   CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
+   *   CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
--- End diff --

Was `external` a typo at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-19072][SQL] codegen of Literal should not output boxed value

2017-01-03 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master b67b35f76 -> cbd11d235


[SPARK-19072][SQL] codegen of Literal should not output boxed value

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/16402 we made a mistake that, when 
double/float is infinity, the `Literal` codegen will output boxed value and 
cause wrong result.

This PR fixes this by special handling infinity to not output boxed value.

## How was this patch tested?

new regression test

Author: Wenchen Fan 

Closes #16469 from cloud-fan/literal.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cbd11d23
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cbd11d23
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cbd11d23

Branch: refs/heads/master
Commit: cbd11d235752d0ab30cfdbf2351cb3e68a123606
Parents: b67b35f
Author: Wenchen Fan 
Authored: Tue Jan 3 22:40:14 2017 -0800
Committer: Yin Huai 
Committed: Tue Jan 3 22:40:14 2017 -0800

--
 .../sql/catalyst/expressions/literals.scala | 30 +---
 .../catalyst/expressions/PredicateSuite.scala   |  5 
 2 files changed, 24 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cbd11d23/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
--
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
index ab45c41..cb0c4d3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala
@@ -266,33 +266,41 @@ case class Literal (value: Any, dataType: DataType) 
extends LeafExpression {
   override def eval(input: InternalRow): Any = value
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+val javaType = ctx.javaType(dataType)
 // change the isNull and primitive to consts, to inline them
 if (value == null) {
   ev.isNull = "true"
-  ev.copy(s"final ${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};")
+  ev.copy(s"final $javaType ${ev.value} = ${ctx.defaultValue(dataType)};")
 } else {
   ev.isNull = "false"
-  ev.value = dataType match {
-case BooleanType | IntegerType | DateType => value.toString
+  dataType match {
+case BooleanType | IntegerType | DateType =>
+  ev.copy(code = "", value = value.toString)
 case FloatType =>
   val v = value.asInstanceOf[Float]
   if (v.isNaN || v.isInfinite) {
-ctx.addReferenceMinorObj(v)
+val boxedValue = ctx.addReferenceMinorObj(v)
+val code = s"final $javaType ${ev.value} = ($javaType) 
$boxedValue;"
+ev.copy(code = code)
   } else {
-s"${value}f"
+ev.copy(code = "", value = s"${value}f")
   }
 case DoubleType =>
   val v = value.asInstanceOf[Double]
   if (v.isNaN || v.isInfinite) {
-ctx.addReferenceMinorObj(v)
+val boxedValue = ctx.addReferenceMinorObj(v)
+val code = s"final $javaType ${ev.value} = ($javaType) 
$boxedValue;"
+ev.copy(code = code)
   } else {
-s"${value}D"
+ev.copy(code = "", value = s"${value}D")
   }
-case ByteType | ShortType => s"(${ctx.javaType(dataType)})$value"
-case TimestampType | LongType => s"${value}L"
-case other => ctx.addReferenceMinorObj(value, ctx.javaType(dataType))
+case ByteType | ShortType =>
+  ev.copy(code = "", value = s"($javaType)$value")
+case TimestampType | LongType =>
+  ev.copy(code = "", value = s"${value}L")
+case other =>
+  ev.copy(code = "", value = ctx.addReferenceMinorObj(value, 
ctx.javaType(dataType)))
   }
-  ev.copy("")
 }
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/cbd11d23/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
--
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
index 6fc3de1..6fe295c 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala
+++

[GitHub] spark issue #16469: [SPARK-19072][SQL] codegen of Literal should not output ...

2017-01-03 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16469
  
LGTM. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15829: [SPARK-18379][SQL] Make the parallelism of parallelParti...

2017-01-02 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15829
  
Sure. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: Update known_translations for contributor names and also fix a small issue in translate-contributors.py

2016-12-29 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master dba81e1dc -> 63036aee2


Update known_translations for contributor names and also fix a small issue in 
translate-contributors.py

## What changes were proposed in this pull request?
This PR updates dev/create-release/known_translations to add more contributor 
name mapping. It also fixes a small issue in translate-contributors.py

## How was this patch tested?
manually tested

Author: Yin Huai <yh...@databricks.com>

Closes #16423 from yhuai/contributors.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/63036aee
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/63036aee
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/63036aee

Branch: refs/heads/master
Commit: 63036aee2271cdbb7032b51b2ac67edbcb82389e
Parents: dba81e1
Author: Yin Huai <yh...@databricks.com>
Authored: Thu Dec 29 14:20:56 2016 -0800
Committer: Yin Huai <yh...@databricks.com>
Committed: Thu Dec 29 14:20:56 2016 -0800

--
 dev/create-release/known_translations| 37 +++
 dev/create-release/translate-contributors.py |  4 ++-
 2 files changed, 40 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/63036aee/dev/create-release/known_translations
--
diff --git a/dev/create-release/known_translations 
b/dev/create-release/known_translations
index 3563fe3..0f30990 100644
--- a/dev/create-release/known_translations
+++ b/dev/create-release/known_translations
@@ -165,3 +165,40 @@ stanzhai - Stan Zhai
 tien-dungle - Tien-Dung Le
 xuchenCN - Xu Chen
 zhangjiajin - Zhang JiaJin
+ClassNotFoundExp - Fu Xing
+KevinGrealish - Kevin Grealish
+MasterDDT - Mitesh Patel
+VinceShieh - Vincent Xie
+WeichenXu123 - Weichen Xu
+Yunni - Yun Ni
+actuaryzhang - Wayne Zhang
+alicegugu - Gu Huiqin Alice
+anabranch - Bill Chambers
+ashangit - Nicolas Fraison
+avulanov - Alexander Ulanov
+biglobster - Liang Ke
+cenyuhai - Cen Yu Hai
+codlife - Jianfei Wang
+david-weiluo-ren - Weiluo (David) Ren
+dding3 - Ding Ding
+fidato13 - Tarun Kumar
+frreiss - Fred Reiss
+gatorsmile - Xiao Li
+hayashidac - Chie Hayashida
+invkrh - Hao Ren
+jagadeesanas2 - Jagadeesan A S
+jiangxb1987 - Jiang Xingbo
+jisookim0513 - Jisoo Kim
+junyangq - Junyang Qian
+krishnakalyan3 - Krishna Kalyan
+linbojin - Linbo Jin
+mpjlu - Peng Meng
+neggert - Nic Eggert
+petermaxlee - Peter Lee
+phalodi - Sandeep Purohit
+pkch - pkch
+priyankagargnitk - Priyanka Garg
+sharkdtu - Sharkd Tu
+shenh062326 - Shen Hong
+aokolnychyi - Anton Okolnychyi
+linbojin - Linbo Jin

http://git-wip-us.apache.org/repos/asf/spark/blob/63036aee/dev/create-release/translate-contributors.py
--
diff --git a/dev/create-release/translate-contributors.py 
b/dev/create-release/translate-contributors.py
index 86fa02d..2cc64e4 100755
--- a/dev/create-release/translate-contributors.py
+++ b/dev/create-release/translate-contributors.py
@@ -147,7 +147,9 @@ print "\n== Translating contributor 
list ===
 lines = contributors_file.readlines()
 contributions = []
 for i, line in enumerate(lines):
-temp_author = line.strip(" * ").split(" -- ")[0]
+# It is possible that a line in the contributor file only has the github 
name, e.g. yhuai.
+# So, we need a strip() to remove the newline.
+temp_author = line.strip(" * ").split(" -- ")[0].strip()
 print "Processing author %s (%d/%d)" % (temp_author, i + 1, len(lines))
 if not temp_author:
 error_msg = "ERROR: Expected the following format \" *  -- 
\"\n"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark issue #16423: Update known_translations for contributor names and also...

2016-12-29 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16423
  
Thanks. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark-website git commit: Fix the list of previous spark summits.

2016-12-29 Thread yhuai

Repository: spark-website
Updated Branches:
  refs/heads/asf-site e10180e67 -> 426a68ba8


Fix the list of previous spark summits.


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/426a68ba
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/426a68ba
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/426a68ba

Branch: refs/heads/asf-site
Commit: 426a68ba8cd6efeaffdacaa0d2b645c5c5ac6a5e
Parents: e10180e
Author: Yin Huai 
Authored: Thu Dec 29 13:07:47 2016 -0800
Committer: Yin Huai 
Committed: Thu Dec 29 13:07:47 2016 -0800

--
 community.md| 19 ++-
 site/community.html | 19 ++-
 2 files changed, 28 insertions(+), 10 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/426a68ba/community.md
--
diff --git a/community.md b/community.md
index d887d31..d8ae250 100644
--- a/community.md
+++ b/community.md
@@ -88,19 +88,28 @@ Chat rooms are great for quick questions or discussions on 
specialized topics. T
 Conferences
 
   
-https://spark-summit.org/;>Spark Summit Europe 2015. Oct 27 - 
Oct 29 in Amsterdam.
+https://spark-summit.org/eu-2016/;>Spark Summit Europe 2016. 
Oct 25 - 27 in Brussels.
   
   
-http://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 
in San Francisco.
+https://spark-summit.org/2016/;>Spark Summit 2016. June 6 - 8 
in San Francisco.
   
   
-http://spark-summit.org/east;>Spark Summit East 2015. March 
18 - 19 in New York City.
+https://spark-summit.org/east-2016/;>Spark Summit East 2016. 
Feb 16 - 18 in New York City.
   
   
-http://spark-summit.org/2014;>Spark Summit 2014. June 30 - 
July 1 2014 in San Francisco.
+https://spark-summit.org/eu-2015/;>Spark Summit Europe 2015. 
Oct 27 - 29 in Amsterdam.
   
   
-http://spark-summit.org/2013;>Spark Summit 2013. December 
2013 in San Francisco.
+https://spark-summit.org/2015;>Spark Summit 2015. June 15 - 
17 in San Francisco.
+  
+  
+https://spark-summit.org/east-2015/;>Spark Summit East 2015. 
March 18 - 19 in New York City.
+  
+  
+https://spark-summit.org/2014;>Spark Summit 2014. June 30 - 
July 1 2014 in San Francisco.
+  
+  
+https://spark-summit.org/2013;>Spark Summit 2013. December 
2013 in San Francisco.
   
 
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/426a68ba/site/community.html
--
diff --git a/site/community.html b/site/community.html
index 7a38701..79604fb 100644
--- a/site/community.html
+++ b/site/community.html
@@ -286,19 +286,28 @@ and include only a few lines of the pertinent code / log 
within the email.
 Conferences
 
   
-https://spark-summit.org/;>Spark Summit Europe 2015. Oct 27 - 
Oct 29 in Amsterdam.
+https://spark-summit.org/eu-2016/;>Spark Summit Europe 2016. 
Oct 25 - 27 in Brussels.
   
   
-http://spark-summit.org/2015;>Spark Summit 2015. June 15 - 17 
in San Francisco.
+https://spark-summit.org/2016/;>Spark Summit 2016. June 6 - 8 
in San Francisco.
   
   
-http://spark-summit.org/east;>Spark Summit East 2015. March 
18 - 19 in New York City.
+https://spark-summit.org/east-2016/;>Spark Summit East 2016. 
Feb 16 - 18 in New York City.
   
   
-http://spark-summit.org/2014;>Spark Summit 2014. June 30 - 
July 1 2014 in San Francisco.
+https://spark-summit.org/eu-2015/;>Spark Summit Europe 2015. 
Oct 27 - 29 in Amsterdam.
   
   
-http://spark-summit.org/2013;>Spark Summit 2013. December 
2013 in San Francisco.
+https://spark-summit.org/2015;>Spark Summit 2015. June 15 - 
17 in San Francisco.
+  
+  
+https://spark-summit.org/east-2015/;>Spark Summit East 2015. 
March 18 - 19 in New York City.
+  
+  
+https://spark-summit.org/2014;>Spark Summit 2014. June 30 - 
July 1 2014 in San Francisco.
+  
+  
+https://spark-summit.org/2013;>Spark Summit 2013. December 
2013 in San Francisco.
   
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0

2016-12-29 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/releases/spark-release-0-9-1.html
--
diff --git a/site/releases/spark-release-0-9-1.html 
b/site/releases/spark-release-0-9-1.html
index 80401c4..5b08a0b 100644
--- a/site/releases/spark-release-0-9-1.html
+++ b/site/releases/spark-release-0-9-1.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 
@@ -210,9 +210,9 @@
   Fixed hash collision bug in external spilling [https://issues.apache.org/jira/browse/SPARK-1113;>SPARK-1113]
   Fixed conflict with Sparkâs log4j for users relying on other logging 
backends [https://issues.apache.org/jira/browse/SPARK-1190;>SPARK-1190]
   Fixed Graphx missing from Spark assembly jar in maven builds
-  Fixed silent failures due to map output status exceeding Akka frame size 
[https://issues.apache.org/jira/browse/SPARK-1244;>SPARK-1244]
-  Removed Sparkâs unnecessary direct dependency on ASM [https://issues.apache.org/jira/browse/SPARK-782;>SPARK-782]
-  Removed metrics-ganglia from default build due to LGPL license conflict 
[https://issues.apache.org/jira/browse/SPARK-1167;>SPARK-1167]
+  Fixed silent failures due to map output status exceeding Akka frame size 
[https://issues.apache.org/jira/browse/SPARK-1244;>SPARK-1244] 

+  Removed Sparkâs unnecessary direct dependency on ASM [https://issues.apache.org/jira/browse/SPARK-782;>SPARK-782] 
+  Removed metrics-ganglia from default build due to LGPL license conflict 
[https://issues.apache.org/jira/browse/SPARK-1167;>SPARK-1167] 

   Fixed bug in distribution tarball not containing spark assembly jar [https://issues.apache.org/jira/browse/SPARK-1184;>SPARK-1184]
   Fixed bug causing infinite NullPointerException failures due to a null 
in map output locations [https://issues.apache.org/jira/browse/SPARK-1124;>SPARK-1124]
   Fixed bugs in post-job cleanup of schedulerâs data structures
@@ -228,7 +228,7 @@
   Fixed bug making Spark application stall when YARN registration fails 
[https://issues.apache.org/jira/browse/SPARK-1032;>SPARK-1032]
   Race condition in getting HDFS delegation tokens in yarn-client mode [https://issues.apache.org/jira/browse/SPARK-1203;>SPARK-1203]
   Fixed bug in yarn-client mode not exiting properly [https://issues.apache.org/jira/browse/SPARK-1049;>SPARK-1049]
-  Fixed regression bug in ADD_JAR environment variable not correctly 
adding custom jars [https://issues.apache.org/jira/browse/SPARK-1089;>SPARK-1089]
+  Fixed regression bug in ADD_JAR environment variable not correctly 
adding custom jars [https://issues.apache.org/jira/browse/SPARK-1089;>SPARK-1089] 
 
 
 Improvements to other 
deployment scenarios
@@ -239,19 +239,19 @@
 
 Optimizations to MLLib
 
-  Optimized memory usage of ALS [https://issues.apache.org/jira/browse/MLLIB-25;>MLLIB-25]
+  Optimized memory usage of ALS [https://issues.apache.org/jira/browse/MLLIB-25;>MLLIB-25] 
   Optimized computation of YtY for implicit ALS [https://issues.apache.org/jira/browse/SPARK-1237;>SPARK-1237]
   Support for negative implicit input in ALS [https://issues.apache.org/jira/browse/MLLIB-22;>MLLIB-22]
   Setting of a random seed in ALS [https://issues.apache.org/jira/browse/SPARK-1238;>SPARK-1238]
-  Faster construction of features with intercept [https://issues.apache.org/jira/browse/SPARK-1260;>SPARK-1260]
+  Faster construction of features with intercept [https://issues.apache.org/jira/browse/SPARK-1260;>SPARK-1260] 
   Check for intercept and weight in GLMâs addIntercept [https://issues.apache.org/jira/browse/SPARK-1327;>SPARK-1327]
 
 
 Bug fixes and better API 
parity for PySpark
 
   Fixed bug in Python de-pickling [https://issues.apache.org/jira/browse/SPARK-1135;>SPARK-1135]
-  Fixed bug in serialization of strings longer than 64K [https://issues.apache.org/jira/browse/SPARK-1043;>SPARK-1043]
-  Fixed bug that made jobs hang when base file is not available [https://issues.apache.org/jira/browse/SPARK-1025;>SPARK-1025]
+  Fixed bug in serialization of strings longer than 64K [https://issues.apache.org/jira/browse/SPARK-1043;>SPARK-1043] 
+  Fixed bug that made jobs hang when base file is not available [https://issues.apache.org/jira/browse/SPARK-1025;>SPARK-1025] 
   Added Missing RDD operations to PySpark - top, zip, foldByKey, 
repartition, coalesce, getStorageLevel, setName and

[4/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0

2016-12-29 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/mailing-lists.html
--
diff --git a/site/mailing-lists.html b/site/mailing-lists.html
index c113cdd..3e2334f 100644
--- a/site/mailing-lists.html
+++ b/site/mailing-lists.html
@@ -109,7 +109,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -162,6 +162,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -171,9 +174,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/mllib/index.html
--
diff --git a/site/mllib/index.html b/site/mllib/index.html
index e29228b..08f5dc4 100644
--- a/site/mllib/index.html
+++ b/site/mllib/index.html
@@ -109,7 +109,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -162,6 +162,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -171,9 +174,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/amp-camp-2013-registration-ope.html
--
diff --git a/site/news/amp-camp-2013-registration-ope.html 
b/site/news/amp-camp-2013-registration-ope.html
index b9d1aba..88d6d7d 100644
--- a/site/news/amp-camp-2013-registration-ope.html
+++ b/site/news/amp-camp-2013-registration-ope.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/announcing-the-first-spark-summit.html
--
diff --git a/site/news/announcing-the-first-spark-summit.html 
b/site/news/announcing-the-first-spark-summit.html
index 2215895..0c013dc 100644
--- a/site/news/announcing-the-first-spark-summit.html
+++ b/site/news/announcing-the-first-spark-summit.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/fourth-spark-screencast-published.html
--
diff --git a/site/news/fourth-spark-screencast-published.html 
b/site/news/fourth-spark-screencast-published.html
index fe28ecf..efa74d0 100644
--- a/site/news/fourth-spark-screencast-published.html
+++ b/site/news/fourth-spark-screencast-published.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the

[1/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0

2016-12-29 Thread yhuai

Repository: spark-website
Updated Branches:
  refs/heads/asf-site d2bcf1854 -> e10180e67


http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/releases/spark-release-2-1-0.html
--
diff --git a/site/releases/spark-release-2-1-0.html 
b/site/releases/spark-release-2-1-0.html
new file mode 100644
index 000..53017ff
--- /dev/null
+++ b/site/releases/spark-release-2-1-0.html
@@ -0,0 +1,370 @@
+
+
+
+  
+  
+  
+
+  
+ Spark Release 2.1.0 | Apache Spark
+
+  
+
+  
+
+  
+
+  
+  
+  
+
+  
+  
+
+  
+  
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-32518208-2']);
+  _gaq.push(['_trackPageview']);
+  (function() {
+var ga = document.createElement('script'); ga.type = 'text/javascript'; 
ga.async = true;
+ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 
'http://www') + '.google-analytics.com/ga.js';
+var s = document.getElementsByTagName('script')[0]; 
s.parentNode.insertBefore(ga, s);
+  })();
+
+  
+  function trackOutboundLink(link, category, action) {
+try {
+  _gaq.push(['_trackEvent', category , action]);
+} catch(err){}
+
+setTimeout(function() {
+  document.location.href = link.href;
+}, 100);
+  }
+  
+
+  
+  
+
+
+
+
+https://code.jquery.com/jquery.js";>
+https://netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js";>
+
+
+
+
+
+
+  
+
+  
+  
+  Lightning-fast cluster computing
+  
+
+  
+
+
+
+  
+  
+
+  Toggle navigation
+  
+  
+  
+
+  
+
+  
+  
+
+  Download
+  
+
+  Libraries 
+
+
+  SQL and DataFrames
+  Spark Streaming
+  MLlib (machine learning)
+  GraphX (graph)
+  
+  Third-Party 
Projects
+
+  
+  
+
+  Documentation 
+
+
+  Latest Release (Spark 2.1.0)
+  Older Versions and Other 
Resources
+  Frequently Asked Questions
+
+  
+  Examples
+  
+
+  Community 
+
+
+  Mailing Lists  Resources
+  Contributing to Spark
+  https://issues.apache.org/jira/browse/SPARK;>Issue 
Tracker
+  Powered By
+  Project Committers
+
+  
+  
+
+   Developers 
+
+
+  Useful Developer Tools
+  Versioning Policy
+  Release Process
+
+  
+
+
+  
+http://www.apache.org/; class="dropdown-toggle" 
data-toggle="dropdown">
+  Apache Software Foundation 
+
+  http://www.apache.org/;>Apache Homepage
+  http://www.apache.org/licenses/;>License
+  http://www.apache.org/foundation/sponsorship.html;>Sponsorship
+  http://www.apache.org/foundation/thanks.html;>Thanks
+  http://www.apache.org/security/;>Security
+
+  
+
+  
+  
+
+
+
+
+  
+
+  Latest News
+  
+
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
+  Spark 
wins CloudSort Benchmark as the most efficient engine
+  (Nov 15, 2016)
+
+  Spark 2.0.2 
released
+  (Nov 14, 2016)
+
+  Spark 1.6.3 
released
+  (Nov 07, 2016)
+
+  
+  Archive
+
+
+  
+Download Spark
+  
+  
+Built-in Libraries:
+  
+  
+SQL and DataFrames
+Spark Streaming
+MLlib (machine learning)
+GraphX (graph)
+  
+  Third-Party Projects
+
+  
+
+  
+Spark Release 2.1.0
+
+
+Apache Spark 2.1.0 is the second release on the 2.x line. This release 
makes significant strides in the production readiness of Structured Streaming, 
with added support for event
 time watermarks and Kafka 0.10 
support. In addition, this release focuses more on usability, stability, 
and polish, resolving over 1200 tickets.
+
+To download Apache Spark 2.1.0, visit the downloads page. You can consult JIRA for the https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420version=12335644;>detailed
 changes. We have curated a list of high level changes here, grouped by 
major modules.
+
+
+  Core and Spark SQL
+  Structured Streaming
+  MLlib
+  SparkR
+  GraphX
+  Deprecations
+  Changes of behavior
+  Known Issues
+  Credits
+
+
+Core and Spark SQL
+
+
+  API updates
+
+  SPARK-17864: Data type APIs are stable APIs. 
+  SPARK-18351: from_json and to_json for parsing JSON for string 
columns
+  SPARK-16700: When creating a DataFrame in PySpark, Python 
dictionaries can be used as values of a StructType.
+
+  
+  Performance and stability
+
+

[5/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0

2016-12-29 Thread yhuai

Update Spark website for the release of Apache Spark 2.1.0


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/e10180e6
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/e10180e6
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/e10180e6

Branch: refs/heads/asf-site
Commit: e10180e6784e1d9d8771ef42481687aec0a423a2
Parents: d2bcf18
Author: Yin Huai 
Authored: Wed Dec 28 18:00:05 2016 -0800
Committer: Yin Huai 
Committed: Thu Dec 29 07:46:11 2016 -0800

--
 _layouts/global.html|   2 +-
 documentation.md|   1 +
 downloads.md|   6 +-
 js/downloads.js |   1 +
 news/_posts/2016-12-28-spark-2-1-0-released.md  |  14 +
 .../_posts/2016-12-28-spark-release-2-1-0.md| 120 ++
 site/committers.html|  48 ++-
 site/community.html |  16 +-
 site/contributing.html  |  28 +-
 site/developer-tools.html   |  22 +-
 site/docs/latest|   2 +-
 site/documentation.html |  14 +-
 site/downloads.html |  14 +-
 site/examples.html  | 100 ++---
 site/faq.html   |   8 +-
 site/graphx/index.html  |   8 +-
 site/index.html |   8 +-
 site/js/downloads.js|   1 +
 site/mailing-lists.html |   8 +-
 site/mllib/index.html   |   8 +-
 site/news/amp-camp-2013-registration-ope.html   |   8 +-
 .../news/announcing-the-first-spark-summit.html |   8 +-
 .../news/fourth-spark-screencast-published.html |   8 +-
 site/news/index.html|  27 +-
 site/news/nsdi-paper.html   |   8 +-
 site/news/one-month-to-spark-summit-2015.html   |   8 +-
 .../proposals-open-for-spark-summit-east.html   |   8 +-
 ...registration-open-for-spark-summit-east.html |   8 +-
 .../news/run-spark-and-shark-on-amazon-emr.html |   8 +-
 site/news/spark-0-6-1-and-0-5-2-released.html   |   8 +-
 site/news/spark-0-6-2-released.html |   8 +-
 site/news/spark-0-7-0-released.html |   8 +-
 site/news/spark-0-7-2-released.html |   8 +-
 site/news/spark-0-7-3-released.html |   8 +-
 site/news/spark-0-8-0-released.html |   8 +-
 site/news/spark-0-8-1-released.html |   8 +-
 site/news/spark-0-9-0-released.html |   8 +-
 site/news/spark-0-9-1-released.html |  10 +-
 site/news/spark-0-9-2-released.html |  10 +-
 site/news/spark-1-0-0-released.html |   8 +-
 site/news/spark-1-0-1-released.html |   8 +-
 site/news/spark-1-0-2-released.html |   8 +-
 site/news/spark-1-1-0-released.html |  10 +-
 site/news/spark-1-1-1-released.html |   8 +-
 site/news/spark-1-2-0-released.html |   8 +-
 site/news/spark-1-2-1-released.html |   8 +-
 site/news/spark-1-2-2-released.html |  10 +-
 site/news/spark-1-3-0-released.html |   8 +-
 site/news/spark-1-4-0-released.html |   8 +-
 site/news/spark-1-4-1-released.html |   8 +-
 site/news/spark-1-5-0-released.html |   8 +-
 site/news/spark-1-5-1-released.html |   8 +-
 site/news/spark-1-5-2-released.html |   8 +-
 site/news/spark-1-6-0-released.html |   8 +-
 site/news/spark-1-6-1-released.html |   8 +-
 site/news/spark-1-6-2-released.html |   8 +-
 site/news/spark-1-6-3-released.html |   8 +-
 site/news/spark-2-0-0-released.html |   8 +-
 site/news/spark-2-0-1-released.html |   8 +-
 site/news/spark-2-0-2-released.html |   8 +-
 site/news/spark-2-1-0-released.html | 220 +++
 site/news/spark-2.0.0-preview.html  |   8 +-
 .../spark-accepted-into-apache-incubator.html   |   8 +-
 site/news/spark-and-shark-in-the-news.html  |  10 +-
 site/news/spark-becomes-tlp.html|   8 +-
 site/news/spark-featured-in-wired.html  |   8 +-
 .../spark-mailing-lists-moving-to-apache.html   |   8 +-
 site/news/spark-meetups.html|   8 +-
 site/news/spark-screencasts-published.html  |   8 +-
 site/news/spark-summit-2013-is-a-wrap.html  |   8 +-
 site/news/spark-summit-2014-videos-posted.html  |   8 +-
 site/news/spark-summit-2015-videos-posted.html  |   8 +-
 site/news/spark-summit-agenda-posted.html   |   8 +-
 .../spark-summit-east-2015-videos-posted.html   |  10 +-

[3/5] spark-website git commit: Update Spark website for the release of Apache Spark 2.1.0

2016-12-29 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-2.0.0-preview.html
--
diff --git a/site/news/spark-2.0.0-preview.html 
b/site/news/spark-2.0.0-preview.html
index 64acf16..f135bf2 100644
--- a/site/news/spark-2.0.0-preview.html
+++ b/site/news/spark-2.0.0-preview.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-accepted-into-apache-incubator.html
--
diff --git a/site/news/spark-accepted-into-apache-incubator.html 
b/site/news/spark-accepted-into-apache-incubator.html
index 57e4881..257e17e 100644
--- a/site/news/spark-accepted-into-apache-incubator.html
+++ b/site/news/spark-accepted-into-apache-incubator.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-and-shark-in-the-news.html
--
diff --git a/site/news/spark-and-shark-in-the-news.html 
b/site/news/spark-and-shark-in-the-news.html
index 3994fe0..730a667 100644
--- a/site/news/spark-and-shark-in-the-news.html
+++ b/site/news/spark-and-shark-in-the-news.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+  (Dec 28, 2016)
+
   Spark 
wins CloudSort Benchmark as the most efficient engine
   (Nov 15, 2016)
 
@@ -168,9 +171,6 @@
   Spark 1.6.3 
released
   (Nov 07, 2016)
 
-  Spark 2.0.1 
released
-  (Oct 03, 2016)
-
   
   Archive
 
@@ -205,7 +205,7 @@
 http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/;>DataInformed
 interviewed two Spark users and wrote about their applications in anomaly 
detection, predictive analytics and data mining.
 
 
-In other news, there will be a full day of tutorials on Spark and Shark at 
the http://strataconf.com/strata2013;>OReilly Strata 
conference in February. They include a three-hour http://strataconf.com/strata2013/public/schedule/detail/27438;>introduction
 to Spark, Shark and BDAS Tuesday morning, and a three-hour http://strataconf.com/strata2013/public/schedule/detail/27440;>hands-on 
exercise session.
+In other news, there will be a full day of tutorials on Spark and Shark at 
the http://strataconf.com/strata2013;>OReilly Strata 
conference in February. They include a three-hour http://strataconf.com/strata2013/public/schedule/detail/27438;>introduction
 to Spark, Shark and BDAS Tuesday morning, and a three-hour http://strataconf.com/strata2013/public/schedule/detail/27440;>hands-on 
exercise session. 
 
 
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/e10180e6/site/news/spark-becomes-tlp.html
--
diff --git a/site/news/spark-becomes-tlp.html b/site/news/spark-becomes-tlp.html
index 803c919..7f6d730 100644
--- a/site/news/spark-becomes-tlp.html
+++ b/site/news/spark-becomes-tlp.html
@@ -106,7 +106,7 @@
   Documentation 
 
 
-  Latest Release (Spark 2.0.2)
+  Latest Release (Spark 2.1.0)
   Older Versions and Other 
Resources
   Frequently Asked Questions
 
@@ -159,6 +159,9 @@
   Latest News
   
 
+  Spark 2.1.0 
released
+

[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16233
  
Seems it is good to know what issues need to be addressed before we can 
switch to this new approach. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

2016-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16233
  
Changes look good to me. @gatorsmile @hvanhovell what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r94104622
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,93 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u @ UnresolvedRelation(table: TableIdentifier, _) if 
isRunningDirectlyOnFiles(table) =>
+u
+  case u: UnresolvedRelation =>
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // Hive support is required to resolve a persistent view, the 
logical plan returned by
+  // catalog.lookupRelation() should be:
--- End diff --

oh, we are not parsing view text when we do not have hive support. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r94103444
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,93 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
--- End diff --

oh, nvm, even if we have `database.viewname`, dbname and viewname will be 
inside the table identifier. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r94103404
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,93 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
--- End diff --

btw, will we allow cases like `CREATE VIEW database.viewname`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand

2016-12-28 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master 93f35569f -> 7d19b6ab7


[SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelectCommand

## What changes were proposed in this pull request?

The `CreateDataSourceTableAsSelectCommand` is quite complex now, as it has a 
lot of work to do if the table already exists:

1. throw exception if we don't want to ignore it.
2. do some check and adjust the schema if we want to append data.
3. drop the table and create it again if we want to overwrite.

The work 2 and 3 should be done by analyzer, so that we can also apply it to 
hive tables.

## How was this patch tested?

existing tests.

Author: Wenchen Fan 

Closes #15996 from cloud-fan/append.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7d19b6ab
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7d19b6ab
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7d19b6ab

Branch: refs/heads/master
Commit: 7d19b6ab7d75b95d9eb1c7e1f228d23fd482306e
Parents: 93f3556
Author: Wenchen Fan 
Authored: Wed Dec 28 21:50:21 2016 -0800
Committer: Yin Huai 
Committed: Wed Dec 28 21:50:21 2016 -0800

--
 .../org/apache/spark/sql/DataFrameWriter.scala  |  78 +
 .../command/createDataSourceTables.scala| 167 +--
 .../spark/sql/execution/datasources/rules.scala | 164 +-
 .../sql/hive/MetastoreDataSourcesSuite.scala|   2 +-
 4 files changed, 213 insertions(+), 198 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7d19b6ab/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
--
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
index 9c5660a..405f38a 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala
@@ -23,11 +23,12 @@ import scala.collection.JavaConverters._
 
 import org.apache.spark.annotation.InterfaceStability
 import org.apache.spark.sql.catalyst.TableIdentifier
-import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.analysis.{EliminateSubqueryAliases, 
UnresolvedRelation}
 import org.apache.spark.sql.catalyst.catalog.{BucketSpec, CatalogTable, 
CatalogTableType}
 import org.apache.spark.sql.catalyst.plans.logical.InsertIntoTable
 import org.apache.spark.sql.execution.command.DDLUtils
-import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource}
+import org.apache.spark.sql.execution.datasources.{CreateTable, DataSource, 
LogicalRelation}
+import org.apache.spark.sql.sources.BaseRelation
 import org.apache.spark.sql.types.StructType
 
 /**
@@ -364,7 +365,11 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   throw new AnalysisException("Cannot create hive serde table with 
saveAsTable API")
 }
 
-val tableExists = 
df.sparkSession.sessionState.catalog.tableExists(tableIdent)
+val catalog = df.sparkSession.sessionState.catalog
+val tableExists = catalog.tableExists(tableIdent)
+val db = tableIdent.database.getOrElse(catalog.getCurrentDatabase)
+val tableIdentWithDB = tableIdent.copy(database = Some(db))
+val tableName = tableIdentWithDB.unquotedString
 
 (tableExists, mode) match {
   case (true, SaveMode.Ignore) =>
@@ -373,39 +378,48 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
   case (true, SaveMode.ErrorIfExists) =>
 throw new AnalysisException(s"Table $tableIdent already exists.")
 
-  case _ =>
-val existingTable = if (tableExists) {
-  
Some(df.sparkSession.sessionState.catalog.getTableMetadata(tableIdent))
-} else {
-  None
+  case (true, SaveMode.Overwrite) =>
+// Get all input data source relations of the query.
+val srcRelations = df.logicalPlan.collect {
+  case LogicalRelation(src: BaseRelation, _, _) => src
 }
-val storage = if (tableExists) {
-  existingTable.get.storage
-} else {
-  DataSource.buildStorageFormatFromOptions(extraOptions.toMap)
-}
-val tableType = if (tableExists) {
-  existingTable.get.tableType
-} else if (storage.locationUri.isDefined) {
-  CatalogTableType.EXTERNAL
-} else {
-  CatalogTableType.MANAGED
+EliminateSubqueryAliases(catalog.lookupRelation(tableIdentWithDB)) 
match {
+  // Only do the check if the table is a data source table (the 
relation is a BaseRelation).
+  case LogicalRelation(dest: BaseRelation, _, _) if

[GitHub] spark issue #15996: [SPARK-18567][SQL] Simplify CreateDataSourceTableAsSelec...

2016-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/15996
  
LGTM. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16233: [SPARK-18801][SQL] Add `View` operator to help re...

2016-12-28 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16233#discussion_r94102704
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -510,32 +510,93 @@ class Analyzer(
* Replaces [[UnresolvedRelation]]s with concrete relations from the 
catalog.
*/
   object ResolveRelations extends Rule[LogicalPlan] {
-private def lookupTableFromCatalog(u: UnresolvedRelation): LogicalPlan 
= {
+
+// If the unresolved relation is running directly on files, we just 
return the original
+// UnresolvedRelation, the plan will get resolved later. Else we look 
up the table from catalog
+// and change the default database name if it is a view.
+// We usually look up a table from the default database if the table 
identifier has an empty
+// database part, for a view the default database should be the 
currentDb when the view was
+// created. When the case comes to resolving a nested view, the view 
may have different default
+// database with that the referenced view has, so we need to use the 
variable `defaultDatabase`
+// to track the current default database.
+// When the relation we resolve is a view, we fetch the 
view.desc(which is a CatalogTable), and
+// then set the value of `CatalogTable.viewDefaultDatabase` to the 
variable `defaultDatabase`,
+// we look up the relations that the view references using the default 
database.
+// For example:
+// |- view1 (defaultDatabase = db1)
+//   |- operator
+// |- table2 (defaultDatabase = db1)
+// |- view2 (defaultDatabase = db2)
+//|- view3 (defaultDatabase = db3)
+//   |- view4 (defaultDatabase = db4)
+// In this case, the view `view1` is a nested view, it directly 
references `table2`ã`view2`
+// and `view4`, the view `view2` references `view3`. On resolving the 
table, we look up the
+// relations `table2`ã`view2`ã`view4` using the default database 
`db1`, and look up the
+// relation `view3` using the default database `db2`.
+//
+// Note this is compatible with the views defined by older versions of 
Spark(before 2.2), which
+// have empty defaultDatabase and all the relations in viewText have 
database part defined.
+def resolveRelation(
+plan: LogicalPlan,
+defaultDatabase: Option[String] = None): LogicalPlan = plan match {
+  case u @ UnresolvedRelation(table: TableIdentifier, _) if 
isRunningDirectlyOnFiles(table) =>
+u
+  case u: UnresolvedRelation =>
+val relation = lookupTableFromCatalog(u, defaultDatabase)
+resolveRelation(relation, defaultDatabase)
+  // Hive support is required to resolve a persistent view, the 
logical plan returned by
+  // catalog.lookupRelation() should be:
--- End diff --

Sorry. I probably missed something. A persistent view is stored in the 
external catalog. So, we can always have persistent views, right? (we have a 
InMemoryCatalog, which is another external catalog. It is not very useful 
though. But it is still an external catalog)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[06/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sql-programming-guide.html
--
diff --git a/site/docs/2.1.0/sql-programming-guide.html 
b/site/docs/2.1.0/sql-programming-guide.html
index 17f5981..4534a98 100644
--- a/site/docs/2.1.0/sql-programming-guide.html
+++ b/site/docs/2.1.0/sql-programming-guide.html
@@ -127,95 +127,95 @@
 
 
 
-  Overview
-  SQL
-  Datasets and DataFrames
+  Overview
+  SQL
+  Datasets and DataFrames
 
   
-  Getting 
Started
-  Starting Point: 
SparkSession
-  Creating DataFrames
-  Untyped 
Dataset Operations (aka DataFrame Operations)
-  Running SQL Queries 
Programmatically
-  Global Temporary View
-  Creating Datasets
-  Interoperating with RDDs

-  Inferring the Schema 
Using Reflection
-  Programmatically 
Specifying the Schema
+  Getting Started
+  Starting Point: 
SparkSession
+  Creating DataFrames
+  Untyped Dataset 
Operations (aka DataFrame Operations)
+  Running SQL Queries 
Programmatically
+  Global Temporary View
+  Creating Datasets
+  Interoperating with RDDs 
   
+  Inferring the 
Schema Using Reflection
+  Programmatically Specifying the 
Schema
 
   
 
   
-  Data Sources  
  
-  Generic Load/Save Functions

-  Manually Specifying 
Options
-  Run SQL on files directly
-  Save 
Modes
-  Saving to Persistent 
Tables
+  Data Sources
+  Generic Load/Save 
Functions
+  Manually Specifying 
Options
+  Run SQL on files 
directly
+  Save Modes
+  Saving to Persistent 
Tables
 
   
-  Parquet 
Files
-  Loading Data 
Programmatically
-  Partition Discovery
-  Schema Merging
-  Hive metastore 
Parquet table conversion
-  Hive/Parquet Schema 
Reconciliation
-  Metadata Refreshing
+  Parquet Files
+  Loading Data 
Programmatically
+  Partition Discovery
+  Schema Merging
+  Hive 
metastore Parquet table conversion
+  Hive/Parquet 
Schema Reconciliation
+  Metadata Refreshing
 
   
-  Configuration
+  Configuration
 
   
-  JSON 
Datasets
-  Hive Tables 
   
-  Interacting
 with Different Versions of Hive Metastore
+  JSON Datasets
+  Hive Tables
+  Interacting with 
Different Versions of Hive Metastore
 
   
-  JDBC To Other Databases
-  Troubleshooting
+  JDBC To Other Databases
+  Troubleshooting
 
   
-  Performance Tuning
-  Caching Data In Memory
-  Other Configuration 
Options
+  Performance Tuning
+  Caching Data In Memory
+  Other Configuration 
Options
 
   
-  Distributed SQL Engine
-  Running the Thrift 
JDBC/ODBC server
-  Running the Spark SQL CLI
+  Distributed SQL Engine
+  Running the Thrift 
JDBC/ODBC server
+  Running the Spark SQL 
CLI
 
   
-  Migration 
Guide
-  Upgrading From Spark SQL 
2.0 to 2.1
-  Upgrading From Spark SQL 
1.6 to 2.0
-  Upgrading From Spark SQL 
1.5 to 1.6
-  Upgrading From Spark SQL 
1.4 to 1.5
-  Upgrading from Spark SQL 
1.3 to 1.4
-  DataFrame data 
reader/writer interface
-  DataFrame.groupBy 
retains grouping columns
-  Behavior change on 
DataFrame.withColumn
+  Migration Guide
+  Upgrading From Spark 
SQL 2.0 to 2.1
+  Upgrading From Spark 
SQL 1.6 to 2.0
+  Upgrading From Spark 
SQL 1.5 to 1.6
+  Upgrading From Spark 
SQL 1.4 to 1.5
+  Upgrading from Spark 
SQL 1.3 to 1.4
+  DataFrame data 
reader/writer interface
+  DataFrame.groupBy retains 
grouping columns
+  Behavior 
change on DataFrame.withColumn
 
   
-  Upgrading from Spark SQL 
1.0-1.2 to 1.3
-  Rename of SchemaRDD to 
DataFrame
-  Unification of the 
Java and Scala APIs
-  Isolation
 of Implicit Conversions and Removal of dsl Package (Scala-only)
-  Removal
 of the type aliases in org.apache.spark.sql for DataType (Scala-only)
-  UDF 
Registration Moved to sqlContext.udf (Java  Scala)
-  Python DataTypes No 
Longer Singletons
+  Upgrading from Spark 
SQL 1.0-1.2 to 1.3
+  Rename of SchemaRDD 
to DataFrame
+  Unification of 
the Java and Scala APIs
+  Isolation
 of Implicit Conversions and Removal of dsl Package (Scala-only)
+  Removal
 of the type aliases in org.apache.spark.sql for DataType (Scala-only)
+  UDF Registration 
Moved to sqlContext.udf (Java  Scala)
+  Python 
DataTypes No Longer

[19/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-migration-guides.html
--
diff --git a/site/docs/2.1.0/ml-migration-guides.html 
b/site/docs/2.1.0/ml-migration-guides.html
index 5e8a913..24dfc31 100644
--- a/site/docs/2.1.0/ml-migration-guides.html
+++ b/site/docs/2.1.0/ml-migration-guides.html
@@ -344,21 +344,21 @@ for converting to mllib.linalg types.
 
 
 
-import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.mllib.util.MLUtils
 
 // convert DataFrame columns
 val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
 val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
 // convert a single vector or matrix
 val mlVec: org.apache.spark.ml.linalg.Vector 
= mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix 
= mllibMat.asML
+val mlMat: org.apache.spark.ml.linalg.Matrix 
= mllibMat.asML
 
 Refer to the MLUtils
 Scala docs for further detail.
   
 
 
 
-import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.mllib.util.MLUtils;
 import org.apache.spark.sql.Dataset;
 
 // convert DataFrame columns
@@ -366,21 +366,21 @@ for converting to mllib.linalg types.
 DatasetRow convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
 // convert a single vector or matrix
 org.apache.spark.ml.linalg.Vector mlVec = 
mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = 
mllibMat.asML();
+org.apache.spark.ml.linalg.Matrix mlMat = 
mllibMat.asML();
 
 Refer to the MLUtils 
Java docs for further detail.
   
 
 
 
-from pyspark.mllib.util import MLUtils
+from pyspark.mllib.util import MLUtils
 
-# convert DataFrame columns
+# convert DataFrame columns
 convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
 convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
+# convert a single vector or matrix
 mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
+mlMat = mllibMat.asML()
 
 Refer to the MLUtils
 Python docs for further detail.
   

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-pipeline.html
--
diff --git a/site/docs/2.1.0/ml-pipeline.html b/site/docs/2.1.0/ml-pipeline.html
index fe17564..b57afde 100644
--- a/site/docs/2.1.0/ml-pipeline.html
+++ b/site/docs/2.1.0/ml-pipeline.html
@@ -331,27 +331,27 @@ machine learning pipelines.
 Table of Contents
 
 
-  Main concepts in Pipelines

-  DataFrame
-  Pipeline components
-  Transformers
-  Estimators
-  Properties of pipeline 
components
+  Main concepts in Pipelines

+  DataFrame
+  Pipeline components
+  Transformers
+  Estimators
+  Properties of 
pipeline components
 
   
-  Pipeline

-  How it 
works
-  Details
+  Pipeline
+  How it works
+  Details
 
   
-  Parameters
-  Saving and Loading 
Pipelines
+  Parameters
+  Saving and Loading 
Pipelines
 
   
-  Code 
examples
-  Example: Estimator, 
Transformer, and Param
-  Example: Pipeline
-  Model selection 
(hyperparameter tuning)
+  Code examples
+  Example: 
Estimator, Transformer, and Param
+  Example: Pipeline
+  Model selection 
(hyperparameter tuning)
 
   
 
@@ -541,7 +541,7 @@ Refer to the [`Estimator` Scala 
docs](api/scala/index.html#org.apache.spark.ml.E
 the [`Transformer` Scala 
docs](api/scala/index.html#org.apache.spark.ml.Transformer) and
 the [`Params` Scala 
docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the 
API.
 
-import org.apache.spark.ml.classification.LogisticRegression
+import org.apache.spark.ml.classification.LogisticRegression
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
 import org.apache.spark.sql.Row
@@ -601,7 +601,7 @@ the [`Params` Scala 
docs](api/scala/index.html#org.apache.spark.ml.param.Params)
   .select(features, label, myProbability, prediction)
   .collect()
   .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =
-println(s($features, $label) - prob=$prob, 
prediction=$prediction)
+println(s($features, 
$label) - prob=$prob, prediction=$prediction)
   }
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala"
 in the Spark repo.
 
@@ -612,7 +612,7 @@ Refer to the [`Estimator` Java 
docs](api/java/org/apache/spark/ml/Estimator.html
 the [`Transformer` Java docs](api/java/org/apache/spark/ml/Transformer.html) 
and
 the [`Params` Java docs](api/java/org/apache/spark/ml/param/Params.html) for 
details on the API.
 
-import java.util.Arrays;
+import java.util.Arrays;
 import java.util.List;
 
 import

[01/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

Repository: spark-website
Updated Branches:
  refs/heads/asf-site ecf94f284 -> d2bcf1854


http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/submitting-applications.html
--
diff --git a/site/docs/2.1.0/submitting-applications.html 
b/site/docs/2.1.0/submitting-applications.html
index fc18fa9..0c91739 100644
--- a/site/docs/2.1.0/submitting-applications.html
+++ b/site/docs/2.1.0/submitting-applications.html
@@ -151,14 +151,14 @@ packaging them into a .zip or 
.egg.
 This script takes care of setting up the classpath with Spark and its
 dependencies, and can support different cluster managers and deploy modes that 
Spark supports:
 
-./bin/spark-submit \
+./bin/spark-submit \
   --class main-class \
   --master master-url \
   --deploy-mode deploy-mode \
   --conf key=value \
-  ... # other options
+  ... # other options
   application-jar \
-  [application-arguments]
+  [application-arguments]
 
 Some of the commonly used options are:
 
@@ -194,23 +194,23 @@ you can also specify --supervise to make 
sure that the driver is au
 fails with non-zero exit code. To enumerate all such options available to 
spark-submit,
 run it with --help. Here are a few examples of common options:
 
-# Run application locally on 8 cores
+# Run application locally on 8 
cores
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
-  --master local[8] \
+  --master local[8] \
   /path/to/examples.jar \
-  100
+  100
 
-# Run on a Spark standalone cluster in client deploy 
mode
+# Run on a Spark standalone cluster in client deploy 
mode
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master spark://207.184.161.138:7077 \
   --executor-memory 20G \
   --total-executor-cores 100 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run on a Spark standalone cluster in cluster deploy mode 
with supervise
+# Run on a Spark standalone cluster in cluster deploy mode 
with supervise
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master spark://207.184.161.138:7077 \
@@ -219,26 +219,26 @@ run it with --help. Here are a few examples 
of common options:
   --executor-memory 20G \
   --total-executor-cores 100 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run on a YARN cluster
-export HADOOP_CONF_DIR=XXX
+# Run on a YARN cluster
+export HADOOP_CONF_DIR=XXX
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master yarn \
-  --deploy-mode cluster \  # can be 
client for client mode
+  --deploy-mode cluster \  # can be 
client for client mode
   --executor-memory 20G \
   --num-executors 50 \
   /path/to/examples.jar \
-  1000
+  1000
 
-# Run a Python application on a Spark standalone cluster
+# Run a Python application on a Spark standalone 
cluster
 ./bin/spark-submit \
   --master spark://207.184.161.138:7077 \
   examples/src/main/python/pi.py \
-  1000
+  1000
 
-# Run on a Mesos cluster in cluster deploy mode with 
supervise
+# Run on a Mesos cluster in cluster deploy mode with 
supervise
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master mesos://207.184.161.138:7077 \
@@ -247,7 +247,7 @@ run it with --help. Here are a few examples of 
common options:
   --executor-memory 20G \
   --total-executor-cores 100 \
   http://path/to/examples.jar \
-  1000
+  1000
 
 Master URLs
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/tuning.html
--
diff --git a/site/docs/2.1.0/tuning.html b/site/docs/2.1.0/tuning.html
index ca4ad9f..33a6316 100644
--- a/site/docs/2.1.0/tuning.html
+++ b/site/docs/2.1.0/tuning.html
@@ -129,23 +129,23 @@
 
 
 
-  Data 
Serialization
-  Memory 
Tuning
-  Memory Management Overview
-  Determining Memory 
Consumption
-  Tuning Data Structures
-  Serialized RDD Storage
-  Garbage Collection Tuning
+  Data Serialization
+  Memory Tuning
+  Memory Management 
Overview
+  Determining Memory 
Consumption
+  Tuning Data Structures
+  Serialized RDD Storage
+  Garbage Collection 
Tuning
 
   
-  Other Considerations
-  Level of Parallelism
-  Memory Usage of Reduce 
Tasks
-  Broadcasting Large 
Variables
-  Data 
Locality
+  Other Considerations
+  Level of Parallelism
+  Memory Usage of Reduce 
Tasks
+  Broadcasting Large 
Variables
+  Data Locality
 
   
-  Summary
+  Summary
 
 
 Because of the in-memory nature of most Spark computations, Spark programs 
can be bottlenecked
@@ -194,9 +194,9 @@ in the AllScalaRegistrar from the https://github.com/twitter/chill;>Twi
 
 To register your own custom classes with Kryo, use the 
registerKryoClasses method.
 
-val conf = new SparkConf().setMaster(...).setAppName(...)
+val conf = new SparkConf().setMaster(...).setAppName(...)

[25/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

This version is built from the docs source code generated by applying 
https://github.com/apache/spark/pull/16294 to v2.1.0 (so, other changes in 
branch 2.1 will not affect the doc).


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/d2bcf185
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/d2bcf185
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/d2bcf185

Branch: refs/heads/asf-site
Commit: d2bcf1854b0e0409495e2f1d3c6beaad923f6e6b
Parents: ecf94f2
Author: Yin Huai 
Authored: Wed Dec 28 14:32:43 2016 -0800
Committer: Yin Huai 
Committed: Wed Dec 28 14:32:43 2016 -0800

--
 site/docs/2.1.0/building-spark.html |  46 +-
 site/docs/2.1.0/building-with-maven.html|  14 +-
 site/docs/2.1.0/configuration.html  |  52 +-
 site/docs/2.1.0/ec2-scripts.html| 174 
 site/docs/2.1.0/graphx-programming-guide.html   | 198 ++---
 site/docs/2.1.0/hadoop-provided.html|  14 +-
 .../img/structured-streaming-watermark.png  | Bin 0 -> 252000 bytes
 site/docs/2.1.0/img/structured-streaming.pptx   | Bin 1105413 -> 1113902 bytes
 site/docs/2.1.0/job-scheduling.html |  40 +-
 site/docs/2.1.0/ml-advanced.html|  10 +-
 .../2.1.0/ml-classification-regression.html | 838 +-
 site/docs/2.1.0/ml-clustering.html  | 124 +--
 site/docs/2.1.0/ml-collaborative-filtering.html |  56 +-
 site/docs/2.1.0/ml-features.html| 764 
 site/docs/2.1.0/ml-migration-guides.html|  16 +-
 site/docs/2.1.0/ml-pipeline.html| 178 ++--
 site/docs/2.1.0/ml-tuning.html  | 172 ++--
 site/docs/2.1.0/mllib-clustering.html   | 186 ++--
 .../2.1.0/mllib-collaborative-filtering.html|  48 +-
 site/docs/2.1.0/mllib-data-types.html   | 208 ++---
 site/docs/2.1.0/mllib-decision-tree.html|  94 +-
 .../2.1.0/mllib-dimensionality-reduction.html   |  28 +-
 site/docs/2.1.0/mllib-ensembles.html| 182 ++--
 site/docs/2.1.0/mllib-evaluation-metrics.html   | 302 +++
 site/docs/2.1.0/mllib-feature-extraction.html   | 122 +--
 .../2.1.0/mllib-frequent-pattern-mining.html|  28 +-
 site/docs/2.1.0/mllib-isotonic-regression.html  |  38 +-
 site/docs/2.1.0/mllib-linear-methods.html   | 174 ++--
 site/docs/2.1.0/mllib-naive-bayes.html  |  24 +-
 site/docs/2.1.0/mllib-optimization.html |  50 +-
 site/docs/2.1.0/mllib-pmml-model-export.html|  35 +-
 site/docs/2.1.0/mllib-statistics.html   | 180 ++--
 site/docs/2.1.0/programming-guide.html  | 302 +++
 site/docs/2.1.0/quick-start.html| 166 ++--
 site/docs/2.1.0/running-on-mesos.html   |  52 +-
 site/docs/2.1.0/running-on-yarn.html|  27 +-
 site/docs/2.1.0/spark-standalone.html   |  30 +-
 site/docs/2.1.0/sparkr.html | 145 ++--
 site/docs/2.1.0/sql-programming-guide.html  | 819 +-
 site/docs/2.1.0/storage-openstack-swift.html|  12 +-
 site/docs/2.1.0/streaming-custom-receivers.html |  26 +-
 .../2.1.0/streaming-kafka-0-10-integration.html |  52 +-
 .../docs/2.1.0/streaming-programming-guide.html | 416 -
 .../structured-streaming-kafka-integration.html |  44 +-
 .../structured-streaming-programming-guide.html | 864 ---
 site/docs/2.1.0/submitting-applications.html|  36 +-
 site/docs/2.1.0/tuning.html |  30 +-
 47 files changed, 3926 insertions(+), 3490 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/building-spark.html
--
diff --git a/site/docs/2.1.0/building-spark.html 
b/site/docs/2.1.0/building-spark.html
index b3a720c..5c20245 100644
--- a/site/docs/2.1.0/building-spark.html
+++ b/site/docs/2.1.0/building-spark.html
@@ -127,33 +127,33 @@
 
 
 
-  Building Apache Spark
-  Apache 
Maven
-  Setting up Mavens 
Memory Usage
-  build/mvn
+  Building Apache Spark
+  Apache Maven
+  Setting up 
Mavens Memory Usage
+  build/mvn
 
   
-  Building a Runnable 
Distribution
-  Specifying the Hadoop 
Version
-  Building With Hive and 
JDBC Support
-  Packaging 
without Hadoop Dependencies for YARN
-  Building with Mesos 
support
-  Building for Scala 2.10
-  Building submodules 
individually
-  Continuous Compilation
-  Speeding up Compilation 
with Zinc
-  Building with SBT
-  Â Encrypted

[11/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-linear-methods.html
--
diff --git a/site/docs/2.1.0/mllib-linear-methods.html 
b/site/docs/2.1.0/mllib-linear-methods.html
index 46a1a25..428d778 100644
--- a/site/docs/2.1.0/mllib-linear-methods.html
+++ b/site/docs/2.1.0/mllib-linear-methods.html
@@ -307,23 +307,23 @@
 
 
 
-  Mathematical formulation
-  Loss 
functions
-  Regularizers
-  Optimization
+  Mathematical formulation
+  Loss functions
+  Regularizers
+  Optimization
 
   
-  Classification
-  Linear Support Vector 
Machines (SVMs)
-  Logistic regression
+  Classification
+  Linear Support Vector 
Machines (SVMs)
+  Logistic regression
 
   
-  Regression
-  Linear least 
squares, Lasso, and ridge regression
-  Streaming linear 
regression
+  Regression
+  Linear 
least squares, Lasso, and ridge regression
+  Streaming linear 
regression
 
   
-  Implementation (developer)
+  Implementation (developer)
 
 
 \[
@@ -489,7 +489,7 @@ error.
 
 Refer to the SVMWithSGD
 Scala docs and SVMModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.classification.{SVMModel, 
SVMWithSGD}
+import 
org.apache.spark.mllib.classification.{SVMModel, 
SVMWithSGD}
 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 import org.apache.spark.mllib.util.MLUtils
 
@@ -534,14 +534,14 @@ this way as well. For example, the following code 
produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
 
-import org.apache.spark.mllib.optimization.L1Updater
+import org.apache.spark.mllib.optimization.L1Updater
 
 val svmAlg = new SVMWithSGD()
 svmAlg.optimizer
   .setNumIterations(200)
   .setRegParam(0.1)
   .setUpdater(new L1Updater)
-val modelL1 = svmAlg.run(training)
+val modelL1 = svmAlg.run(training)
 
   
 
@@ -554,7 +554,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the SVMWithSGD
 Java docs and SVMModel
 Java docs for details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
@@ -591,7 +591,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 // Get evaluation metrics.
 BinaryClassificationMetrics metrics =
-  new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels));
+  new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels));
 double auROC = metrics.areaUnderROC();
 
 System.out.println(Area 
under ROC =  + auROC);
@@ -610,14 +610,14 @@ this way as well. For example, the following code 
produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
 
-import org.apache.spark.mllib.optimization.L1Updater;
+import org.apache.spark.mllib.optimization.L1Updater;
 
-SVMWithSGD svmAlg = new SVMWithSGD();
+SVMWithSGD svmAlg = new SVMWithSGD();
 svmAlg.optimizer()
   .setNumIterations(200)
   .setRegParam(0.1)
-  .setUpdater(new L1Updater());
-final SVMModel modelL1 = svmAlg.run(training.rdd());
+  .setUpdater(new L1Updater());
+final SVMModel modelL1 = svmAlg.run(training.rdd());
 
 In order to run the above application, follow the instructions
 provided in the Self-Contained
@@ -632,28 +632,28 @@ and make predictions with the resulting model to compute 
the training error.
 
 Refer to the SVMWithSGD
 Python docs and SVMModel
 Python docs for more details on the API.
 
-from pyspark.mllib.classification import 
SVMWithSGD, SVMModel
+from 
pyspark.mllib.classification import SVMWithSGD, SVMModel
 from pyspark.mllib.regression 
import LabeledPoint
 
-# Load and parse the data
+# Load and parse the data
 def parsePoint(line):
-values = [float(x) for x in line.split( )]
+values = [float(x) for x in line.split( )]
 return LabeledPoint(values[0], values[1:])
 
-data = sc.textFile(data/mllib/sample_svm_data.txt)
+data = sc.textFile(data/mllib/sample_svm_data.txt)
 parsedData = data.map(parsePoint)
 
-# Build the model
+# Build the model
 model = SVMWithSGD.train(parsedData, iterations=100)
 
-# Evaluating the model on training data
+# Evaluating the model on training data
 labelsAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
 trainErr = labelsAndPreds.filter(lambda 
(v, p): v != p).count() / float(parsedData.count())
-print(Training Error =  + str(trainErr))
+print(Training Error =  + str(trainErr))
 
-# Save and load model
-model.save(sc, target/tmp/pythonSVMWithSGDModel)
-sameModel = SVMModel.load(sc, target/tmp/pythonSVMWithSGDModel)
+# Save and load model
+model.save(sc, target/tmp/pythonSVMWithSGDModel)
+sameModel =

[23/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/hadoop-provided.html
--
diff --git a/site/docs/2.1.0/hadoop-provided.html 
b/site/docs/2.1.0/hadoop-provided.html
index ff7afb7..9d77cf0 100644
--- a/site/docs/2.1.0/hadoop-provided.html
+++ b/site/docs/2.1.0/hadoop-provided.html
@@ -133,16 +133,16 @@
 Apache Hadoop
 For Apache distributions, you can use Hadoops 
classpath command. For instance:
 
-### in conf/spark-env.sh ###
+### in conf/spark-env.sh 
###
 
-# If hadoop binary is on your PATH
-export SPARK_DIST_CLASSPATH=$(hadoop classpath)
+# If hadoop binary is on your PATH
+export SPARK_DIST_CLASSPATH=$(hadoop classpath)
 
-# With explicit path to hadoop binary
-export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
+# With explicit path to hadoop binary
+export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
 
-# Passing a Hadoop configuration directory
-export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
+# Passing a Hadoop configuration directory
+export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)
 
 
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming-watermark.png
--
diff --git a/site/docs/2.1.0/img/structured-streaming-watermark.png 
b/site/docs/2.1.0/img/structured-streaming-watermark.png
new file mode 100644
index 000..f21fbda
Binary files /dev/null and 
b/site/docs/2.1.0/img/structured-streaming-watermark.png differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/img/structured-streaming.pptx
--
diff --git a/site/docs/2.1.0/img/structured-streaming.pptx 
b/site/docs/2.1.0/img/structured-streaming.pptx
index 6aad2ed..f5bdfc0 100644
Binary files a/site/docs/2.1.0/img/structured-streaming.pptx and 
b/site/docs/2.1.0/img/structured-streaming.pptx differ

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/job-scheduling.html
--
diff --git a/site/docs/2.1.0/job-scheduling.html 
b/site/docs/2.1.0/job-scheduling.html
index 53161c2..9651607 100644
--- a/site/docs/2.1.0/job-scheduling.html
+++ b/site/docs/2.1.0/job-scheduling.html
@@ -127,24 +127,24 @@
 
 
 
-  Overview
-  Scheduling Across 
Applications
-  Dynamic Resource Allocation   
 
-  Configuration and Setup
-  Resource Allocation Policy 
   
-  Request Policy
-  Remove Policy
+  Overview
+  Scheduling Across 
Applications
+  Dynamic Resource 
Allocation
+  Configuration and 
Setup
+  Resource Allocation 
Policy
+  Request Policy
+  Remove Policy
 
   
-  Graceful Decommission of 
Executors
+  Graceful 
Decommission of Executors
 
   
 
   
-  Scheduling Within an 
Application
-  Fair Scheduler Pools
-  Default Behavior of Pools
-  Configuring Pool 
Properties
+  Scheduling Within an 
Application
+  Fair Scheduler Pools
+  Default Behavior of 
Pools
+  Configuring Pool 
Properties
 
   
 
@@ -321,9 +321,9 @@ mode is best for multi-user settings.
 To enable the fair scheduler, simply set the 
spark.scheduler.mode property to FAIR when configuring
 a SparkContext:
 
-val conf = new SparkConf().setMaster(...).setAppName(...)
+val conf = new SparkConf().setMaster(...).setAppName(...)
 conf.set(spark.scheduler.mode, FAIR)
-val sc = 
new SparkContext(conf)
+val sc = 
new SparkContext(conf)
 
 Fair Scheduler Pools
 
@@ -337,15 +337,15 @@ many concurrent jobs they have instead of giving 
jobs equal shares. Thi
 adding the spark.scheduler.pool local property to 
the SparkContext in the thread thats submitting them.
 This is done as follows:
 
-// Assuming sc is your SparkContext 
variable
-sc.setLocalProperty(spark.scheduler.pool, pool1)
+// Assuming sc is your 
SparkContext variable
+sc.setLocalProperty(spark.scheduler.pool, pool1)
 
 After setting this local property, all jobs submitted within this 
thread (by calls in this thread
 to RDD.save, count, collect, etc) will 
use this pool name. The setting is per-thread to make
 it easy to have a thread run multiple jobs on behalf of the same user. If 
youd like to clear the
 pool that a thread is associated with, simply call:
 
-sc.setLocalProperty(spark.scheduler.pool, null)
+sc.setLocalProperty(spark.scheduler.pool, null)
 
 Default Behavior of Pools
 
@@ -379,12 +379,12 @@ of the cluster. By default, each pools 
minShare is 0.
 and setting a spark.scheduler.allocation.file property in your
 SparkConf.
 
-conf.set(spark.scheduler.allocation.file,

[12/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-feature-extraction.html
--
diff --git a/site/docs/2.1.0/mllib-feature-extraction.html 
b/site/docs/2.1.0/mllib-feature-extraction.html
index 4726b37..f8cd98e 100644
--- a/site/docs/2.1.0/mllib-feature-extraction.html
+++ b/site/docs/2.1.0/mllib-feature-extraction.html
@@ -307,32 +307,32 @@
 
 
 
-  TF-IDF
-  Word2Vec
-  Model
-  Example
+  TF-IDF
+  Word2Vec
+  Model
+  Example
 
   
-  StandardScaler
-  Model 
Fitting
-  Example
+  StandardScaler
+  Model Fitting
+  Example
 
   
-  Normalizer
-  Example
+  Normalizer
+  Example
 
   
-  ChiSqSelector
-  Model 
Fitting
-  Example
+  ChiSqSelector
+  Model Fitting
+  Example
 
   
-  ElementwiseProduct
-  Example
+  ElementwiseProduct
+  Example
 
   
-  PCA
-  Example
+  PCA
+  Example
 
   
 
@@ -390,7 +390,7 @@ Each record could be an iterable of strings or other 
types.
 
 Refer to the HashingTF
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.feature.{HashingTF, IDF}
+import 
org.apache.spark.mllib.feature.{HashingTF, 
IDF}
 import org.apache.spark.mllib.linalg.Vector
 import org.apache.spark.rdd.RDD
 
@@ -424,24 +424,24 @@ Each record could be an iterable of strings or other 
types.
 
 Refer to the HashingTF
 Python docs for details on the API.
 
-from pyspark.mllib.feature import HashingTF, IDF
+from 
pyspark.mllib.feature import 
HashingTF, IDF
 
-# Load documents (one per line).
-documents = sc.textFile(data/mllib/kmeans_data.txt).map(lambda line: line.split( ))
+# Load documents (one per line).
+documents = sc.textFile(data/mllib/kmeans_data.txt).map(lambda line: line.split( ))
 
 hashingTF = HashingTF()
 tf = hashingTF.transform(documents)
 
-# While applying HashingTF only needs a single pass to the 
data, applying IDF needs two passes:
-# First to compute the IDF vector and second to scale the term 
frequencies by IDF.
+# While applying HashingTF only needs a single pass to the 
data, applying IDF needs two passes:
+# First to compute the IDF vector and second to scale the 
term frequencies by IDF.
 tf.cache()
 idf = IDF().fit(tf)
 tfidf = idf.transform(tf)
 
-# spark.mllibs IDF implementation provides an option for 
ignoring terms
-# which occur in less than a minimum number of 
documents.
-# In such cases, the IDF for these terms is set to 0.
-# This feature can be used by passing the minDocFreq value to 
the IDF constructor.
+# spark.mllibs IDF implementation provides an option for 
ignoring terms
+# which occur in less than a minimum number of 
documents.
+# In such cases, the IDF for these terms is set to 0.
+# This feature can be used by passing the minDocFreq value to 
the IDF constructor.
 idfIgnore = IDF(minDocFreq=2).fit(tf)
 tfidfIgnore = idfIgnore.transform(tf)
 
@@ -467,7 +467,7 @@ skip-gram model is to maximize the average log-likelihood
 \[
 \frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)
 \]
-where $k$ is the size of the training window.
+where $k$ is the size of the training window.  
 
 In the skip-gram model, every word $w$ is associated with two vectors $u_w$ 
and $v_w$ which are 
 vector representations of $w$ as word and context respectively. The 
probability of correctly 
@@ -475,7 +475,7 @@ predicting word $w_i$ given word $w_j$ is determined by the 
softmax model, which
 \[
 p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} 
\exp(u_l^{\top}v_{w_j})}
 \]
-where $V$ is the vocabulary size.
+where $V$ is the vocabulary size. 
 
 The skip-gram model with softmax is expensive because the cost of computing 
$\log p(w_i | w_j)$ 
 is proportional to $V$, which can be easily in order of millions. To speed up 
training of Word2Vec, 
@@ -488,13 +488,13 @@ $O(\log(V))$
 construct a Word2Vec instance and then fit a 
Word2VecModel with the input data. Finally,
 we display the top 40 synonyms of the specified word. To run the example, 
first download
 the http://mattmahoney.net/dc/text8.zip;>text8 data and extract 
it to your preferred directory.
-Here we assume the extracted file is text8 and in same directory 
as you run the spark shell.
+Here we assume the extracted file is text8 and in same directory 
as you run the spark shell.  
 
 
 
 Refer to the Word2Vec
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
+import 
org.apache.spark.mllib.feature.{Word2Vec, 
Word2VecModel}
 
 val input = sc.textFile(data/mllib/sample_lda_data.txt).map(line = line.split( ).toSeq)
 
@@ -505,7 +505,7 @@ Here we assume the extracted file is text8 and 
in same directory as
 val synonyms = model.findSynonyms(1, 5)
 
 for((synonym, cosineSimilarity) - synonyms) 
{
-

[17/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-clustering.html
--
diff --git a/site/docs/2.1.0/mllib-clustering.html 
b/site/docs/2.1.0/mllib-clustering.html
index 9667606..1b50dab 100644
--- a/site/docs/2.1.0/mllib-clustering.html
+++ b/site/docs/2.1.0/mllib-clustering.html
@@ -366,12 +366,12 @@ models are trained for each cluster).
 The spark.mllib package supports the following models:
 
 
-  K-means
-  Gaussian 
mixture
-  Power iteration clustering 
(PIC)
-  Latent Dirichlet allocation 
(LDA)
-  Bisecting k-means
-  Streaming k-means
+  K-means
+  Gaussian mixture
+  Power iteration clustering 
(PIC)
+  Latent Dirichlet allocation 
(LDA)
+  Bisecting k-means
+  Streaming k-means
 
 
 K-means
@@ -408,7 +408,7 @@ optimal k is usually one where there is an 
elbow in the W
 
 Refer to the KMeans
 Scala docs and KMeansModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
+import 
org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
 import org.apache.spark.mllib.linalg.Vectors
 
 // Load and parse the data
@@ -440,7 +440,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the KMeans
 Java docs and KMeansModel
 Java docs for details on the API.
 
-import org.apache.spark.api.java.JavaRDD;
+import 
org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.mllib.clustering.KMeans;
 import org.apache.spark.mllib.clustering.KMeansModel;
@@ -470,7 +470,7 @@ that is equivalent to the provided example in Scala is 
given below:
 KMeansModel clusters = KMeans.train(parsedData.rdd(), numClusters, numIterations);
 
 System.out.println(Cluster 
centers:);
-for (Vector center: clusters.clusterCenters()) {
+for (Vector center: 
clusters.clusterCenters()) {
   System.out.println( 
 + center);
 }
 double cost = clusters.computeCost(parsedData.rdd());
@@ -498,29 +498,29 @@ fact the optimal k is usually one where there is 
an elbow
 
 Refer to the KMeans
 Python docs and KMeansModel
 Python docs for more details on the API.
 
-from numpy import array
+from 
numpy import array
 from math import sqrt
 
 from pyspark.mllib.clustering 
import KMeans, KMeansModel
 
-# Load and parse the data
-data = sc.textFile(data/mllib/kmeans_data.txt)
-parsedData = data.map(lambda line: array([float(x) for x in line.split( )]))
+# Load and parse the data
+data = sc.textFile(data/mllib/kmeans_data.txt)
+parsedData = data.map(lambda line: array([float(x) for x in line.split( )]))
 
-# Build the model (cluster the data)
-clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=random)
+# Build the model (cluster the data)
+clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode=random)
 
-# Evaluate clustering by computing Within Set Sum of Squared 
Errors
+# Evaluate clustering by computing Within Set Sum of Squared 
Errors
 def error(point):
 center = clusters.centers[clusters.predict(point)]
 return sqrt(sum([x**2 for x in (point - center)]))
 
 WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)
-print(Within Set Sum of Squared Error =  + str(WSSSE))
+print(Within Set Sum of Squared Error =  + str(WSSSE))
 
-# Save and load model
-clusters.save(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel)
-sameModel = KMeansModel.load(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel)
+# Save and load model
+clusters.save(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel)
+sameModel = KMeansModel.load(sc, target/org/apache/spark/PythonKMeansExample/KMeansModel)
 
 Find full example code at 
"examples/src/main/python/mllib/k_means_example.py" in the Spark 
repo.
   
@@ -554,7 +554,7 @@ to the algorithm. We then output the parameters of the 
mixture model.
 
 Refer to the GaussianMixture
 Scala docs and GaussianMixtureModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel}
+import 
org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel}
 import org.apache.spark.mllib.linalg.Vectors
 
 // Load and parse the data
@@ -587,7 +587,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the GaussianMixture
 Java docs and GaussianMixtureModel
 Java docs for details on the API.
 
-import org.apache.spark.api.java.JavaRDD;
+import 
org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.mllib.clustering.GaussianMixture;
 import org.apache.spark.mllib.clustering.GaussianMixtureModel;
@@ -612,7 +612,7 @@ that is equivalent to the provided example in Scala is 
given below:
 parsedData.cache();
 
 // Cluster the data into two classes using 
GaussianMixture
-GaussianMixtureModel

[13/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-evaluation-metrics.html
--
diff --git a/site/docs/2.1.0/mllib-evaluation-metrics.html 
b/site/docs/2.1.0/mllib-evaluation-metrics.html
index 4bc636d..0d5bb3b 100644
--- a/site/docs/2.1.0/mllib-evaluation-metrics.html
+++ b/site/docs/2.1.0/mllib-evaluation-metrics.html
@@ -307,20 +307,20 @@
 
 
 
-  Classification model 
evaluation
-  Binary classification
-  Threshold tuning
+  Classification model 
evaluation
+  Binary classification

+  Threshold tuning
 
   
-  Multiclass classification   
 
-  Label based metrics
+  Multiclass classification   
 
+  Label based metrics
 
   
-  Multilabel classification
-  Ranking 
systems
+  Multilabel 
classification
+  Ranking systems
 
   
-  Regression model 
evaluation
+  Regression model 
evaluation
 
 
 spark.mllib comes with a number of machine learning algorithms 
that can be used to learn from and make predictions
@@ -421,7 +421,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the LogisticRegressionWithLBFGS
 Scala docs and BinaryClassificationMetrics
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
+import 
org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
 import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.util.MLUtils
@@ -453,13 +453,13 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 // Precision by threshold
 val precision = metrics.precisionByThreshold
 precision.foreach { case 
(t, p) =
-  println(sThreshold: $t, Precision: 
$p)
+  println(sThreshold: $t, 
Precision: $p)
 }
 
 // Recall by threshold
 val recall = metrics.recallByThreshold
 recall.foreach { case 
(t, r) =
-  println(sThreshold: $t, Recall: 
$r)
+  println(sThreshold: $t, 
Recall: $r)
 }
 
 // Precision-Recall Curve
@@ -468,13 +468,13 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 // F-measure
 val f1Score = metrics.fMeasureByThreshold
 f1Score.foreach { case 
(t, f) =
-  println(sThreshold: $t, F-score: $f, Beta = 
1)
+  println(sThreshold: $t, 
F-score: $f, Beta = 
1)
 }
 
 val beta = 0.5
 val fScore = metrics.fMeasureByThreshold(beta)
 f1Score.foreach { case 
(t, f) =
-  println(sThreshold: $t, F-score: $f, Beta = 
0.5)
+  println(sThreshold: $t, 
F-score: $f, Beta = 
0.5)
 }
 
 // AUPRC
@@ -498,7 +498,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the LogisticRegressionModel
 Java docs and LogisticRegressionWithLBFGS
 Java docs for details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.Function;
@@ -518,7 +518,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 JavaRDDLabeledPoint test = splits[1];
 
 // Run training algorithm to build the model.
-final LogisticRegressionModel 
model = new LogisticRegressionWithLBFGS()
+final LogisticRegressionModel 
model = new LogisticRegressionWithLBFGS()
   .setNumClasses(2)
   .run(training.rdd());
 
@@ -538,7 +538,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 // Get evaluation metrics.
 BinaryClassificationMetrics metrics =
-  new BinaryClassificationMetrics(predictionAndLabels.rdd());
+  new BinaryClassificationMetrics(predictionAndLabels.rdd());
 
 // Precision by threshold
 JavaRDDTuple2Object, Object precision = metrics.precisionByThreshold().toJavaRDD();
@@ -564,7 +564,7 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
   new FunctionTuple2Object, 
Object, Double() {
 @Override
 public Double call(Tuple2Object, Object t) {
-  return new Double(t._1().toString());
+  return new Double(t._1().toString());
 }
   }
 );
@@ -590,34 +590,34 @@ data, and evaluate the performance of the algorithm by 
several binary evaluation
 
 Refer to the BinaryClassificationMetrics
 Python docs and LogisticRegressionWithLBFGS
 Python docs for more details on the API.
 
-from pyspark.mllib.classification import 
LogisticRegressionWithLBFGS
+from 
pyspark.mllib.classification import LogisticRegressionWithLBFGS
 from pyspark.mllib.evaluation 
import BinaryClassificationMetrics
 from pyspark.mllib.regression 
import LabeledPoint
 
-# Several of the methods available in scala are currently 
missing from pyspark
-# Load training data in LIBSVM format
+# Several of the methods available in scala are currently 
missing from pyspark
+#

[08/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/quick-start.html
--
diff --git a/site/docs/2.1.0/quick-start.html b/site/docs/2.1.0/quick-start.html
index 76e67e1..9d5fad7 100644
--- a/site/docs/2.1.0/quick-start.html
+++ b/site/docs/2.1.0/quick-start.html
@@ -129,14 +129,14 @@
 
 
 
-  Interactive 
Analysis with the Spark Shell
-  Basics
-  More on RDD Operations
-  Caching
+  Interactive 
Analysis with the Spark Shell
+  Basics
+  More on RDD Operations
+  Caching
 
   
-  Self-Contained 
Applications
-  Where to Go from Here
+  Self-Contained 
Applications
+  Where to Go from Here
 
 
 This tutorial provides a quick introduction to using Spark. We will first 
introduce the API through Sparks
@@ -164,26 +164,26 @@ or Python. Start it by running the following in the Spark 
directory:
 
 Sparks primary abstraction is a distributed collection of items 
called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs. Lets 
make a new RDD from the text of the README file in the Spark source 
directory:
 
-scala val textFile = 
sc.textFile(README.md)
-textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at console:25
+scala val textFile = sc.textFile(README.md)
+textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at console:25
 
 RDDs have actions, which return values, 
and transformations, which 
return pointers to new RDDs. Lets start with a few actions:
 
-scala textFile.count() // Number of 
items in this RDD
+scala textFile.count() // Number of items in this RDD
 res0: Long = 126 
// May be different from yours as README.md will change over 
time, similar to other outputs
 
 scala textFile.first() // First item 
in this RDD
-res1: String = # 
Apache Spark
+res1: String = # 
Apache Spark
 
 Now lets use a transformation. We will use the filter 
transformation to return a new RDD with a subset of the items in the file.
 
-scala val linesWithSpark = textFile.filter(line 
= line.contains(Spark))
-linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at console:27
+scala val linesWithSpark = textFile.filter(line 
= line.contains(Spark))
+linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at console:27
 
 We can chain together transformations and actions:
 
-scala textFile.filter(line 
= line.contains(Spark)).count() // How many 
lines contain Spark?
-res3: Long = 15
+scala textFile.filter(line = line.contains(Spark)).count() // How many 
lines contain Spark?
+res3: Long = 15
 
   
 
@@ -193,24 +193,24 @@ or Python. Start it by running the following in the Spark 
directory:
 
 Sparks primary abstraction is a distributed collection of items 
called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop 
InputFormats (such as HDFS files) or by transforming other RDDs. Lets 
make a new RDD from the text of the README file in the Spark source 
directory:
 
- textFile = sc.textFile(README.md)
+ textFile = sc.textFile(README.md)
 
 RDDs have actions, which return values, 
and transformations, which 
return pointers to new RDDs. Lets start with a few actions:
 
- textFile.count()  # Number of 
items in this RDD
+ textFile.count()  # Number of 
items in this RDD
 126
 
- textFile.first()  # First item in this RDD
-u# Apache Spark
+ textFile.first()  # First item in this RDD
+u# Apache 
Spark
 
 Now lets use a transformation. We will use the filter 
transformation to return a new RDD with a subset of the items in the file.
 
- linesWithSpark = textFile.filter(lambda 
line: Spark in line)
+ linesWithSpark = textFile.filter(lambda 
line: Spark in line)
 
 We can chain together transformations and actions:
 
- textFile.filter(lambda 
line: Spark in line).count()  # How many 
lines contain Spark?
-15
+ textFile.filter(lambda 
line: Spark in line).count()  # How many 
lines contain Spark?
+15
 
   
 
@@ -221,38 +221,38 @@ or Python. Start it by running the following in the Spark 
directory:
 
 
 
-scala textFile.map(line = line.split( 
).size).reduce((a, b) = if (a  b) a else b)
-res4: Long = 15
+scala textFile.map(line = line.split( ).size).reduce((a, b) = if (a  b) a else b)
+res4: Long = 15
 
 This first maps a line to an integer value, creating a new RDD. 
reduce is called on that RDD to find the largest line count. The 
arguments to map and reduce are Scala function 
literals (closures), and can use any language feature or Scala/Java library. 
For example, we can easily call functions declared elsewhere. Well use 
Math.max() function to make this code easier to understand:
 
-scala import java.lang.Math
+scala

[14/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-decision-tree.html
--
diff --git a/site/docs/2.1.0/mllib-decision-tree.html 
b/site/docs/2.1.0/mllib-decision-tree.html
index 1a3d865..991610e 100644
--- a/site/docs/2.1.0/mllib-decision-tree.html
+++ b/site/docs/2.1.0/mllib-decision-tree.html
@@ -307,23 +307,23 @@
 
 
 
-  Basic 
algorithm
-  Node impurity and 
information gain
-  Split 
candidates
-  Stopping 
rule
+  Basic algorithm
+  Node impurity and 
information gain
+  Split candidates
+  Stopping rule
 
   
-  Usage tips
-  Problem specification 
parameters
-  Stopping criteria
-  Tunable parameters
-  Caching and checkpointing
+  Usage tips
+  Problem specification 
parameters
+  Stopping criteria
+  Tunable parameters
+  Caching and 
checkpointing
 
   
-  Scaling
-  Examples
-  Classification
-  Regression
+  Scaling
+  Examples
+  Classification
+  Regression
 
   
 
@@ -548,7 +548,7 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Scala docs and DecisionTreeModel
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.tree.DecisionTree
+import 
org.apache.spark.mllib.tree.DecisionTree
 import org.apache.spark.mllib.tree.model.DecisionTreeModel
 import org.apache.spark.mllib.util.MLUtils
 
@@ -588,7 +588,7 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Java docs and DecisionTreeModel
 Java docs for details on the API.
 
-import java.util.HashMap;
+import 
java.util.HashMap;
 import java.util.Map;
 
 import scala.Tuple2;
@@ -604,8 +604,8 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 import org.apache.spark.mllib.tree.model.DecisionTreeModel;
 import org.apache.spark.mllib.util.MLUtils;
 
-SparkConf sparkConf = new SparkConf().setAppName(JavaDecisionTreeClassificationExample);
-JavaSparkContext jsc = new JavaSparkContext(sparkConf);
+SparkConf sparkConf = new SparkConf().setAppName(JavaDecisionTreeClassificationExample);
+JavaSparkContext jsc = new JavaSparkContext(sparkConf);
 
 // Load and parse the data file.
 String datapath = data/mllib/sample_libsvm_data.txt;
@@ -657,30 +657,30 @@ maximum tree depth of 5. The test error is calculated to 
measure the algorithm a
 
 Refer to the DecisionTree
 Python docs and DecisionTreeModel
 Python docs for more details on the API.
 
-from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
+from 
pyspark.mllib.tree import DecisionTree, DecisionTreeModel
 from pyspark.mllib.util import MLUtils
 
-# Load and parse the data file into an RDD of 
LabeledPoint.
-data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)
-# Split the data into training and test sets (30% held out for 
testing)
+# Load and parse the data file into an RDD of 
LabeledPoint.
+data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)
+# Split the data into training and test sets (30% held out 
for testing)
 (trainingData, testData) 
= data.randomSplit([0.7, 0.3])
 
-# Train a DecisionTree model.
-#  Empty categoricalFeaturesInfo indicates all features are 
continuous.
+# Train a DecisionTree model.
+#  Empty categoricalFeaturesInfo indicates all features are 
continuous.
 model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
- impurity=gini, 
maxDepth=5, maxBins=32)
+ impurity=gini, maxDepth=5, maxBins=32)
 
-# Evaluate model on test instances and compute test 
error
+# Evaluate model on test instances and compute test 
error
 predictions = model.predict(testData.map(lambda x: x.features))
 labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
 testErr = labelsAndPredictions.filter(lambda 
(v, p): v != p).count() / float(testData.count())
-print(Test 
Error =  + str(testErr))
-print(Learned classification tree model:)
+print(Test 
Error =  + str(testErr))
+print(Learned classification tree model:)
 print(model.toDebugString())
 
-# Save and load model
-model.save(sc, target/tmp/myDecisionTreeClassificationModel)
-sameModel = DecisionTreeModel.load(sc, target/tmp/myDecisionTreeClassificationModel)
+# Save and load model
+model.save(sc, target/tmp/myDecisionTreeClassificationModel)
+sameModel = DecisionTreeModel.load(sc, target/tmp/myDecisionTreeClassificationModel)
 
 Find full example code at 
"examples/src/main/python/mllib/decision_tree_classification_example.py" in the 
Spark repo.
   
@@ -701,7 +701,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the 
end to evaluate
 
 Refer to the DecisionTree
 Scala docs and DecisionTreeModel
 Scala

[20/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-features.html
--
diff --git a/site/docs/2.1.0/ml-features.html b/site/docs/2.1.0/ml-features.html
index 64463de..a2f102b 100644
--- a/site/docs/2.1.0/ml-features.html
+++ b/site/docs/2.1.0/ml-features.html
@@ -318,52 +318,52 @@
 Table of Contents
 
 
-  Feature Extractors
-  TF-IDF
-  Word2Vec
-  CountVectorizer
+  Feature Extractors
+  TF-IDF
+  Word2Vec
+  CountVectorizer
 
   
-  Feature Transformers
-  Tokenizer
-  StopWordsRemover
-  $n$-gram
-  Binarizer
-  PCA
-  PolynomialExpansion
-  Discrete Cosine Transform 
(DCT)
-  StringIndexer
-  IndexToString
-  OneHotEncoder
-  VectorIndexer
-  Interaction
-  Normalizer
-  StandardScaler
-  MinMaxScaler
-  MaxAbsScaler
-  Bucketizer
-  ElementwiseProduct
-  SQLTransformer
-  VectorAssembler
-  QuantileDiscretizer
+  Feature Transformers
+  Tokenizer
+  StopWordsRemover
+  $n$-gram
+  Binarizer
+  PCA
+  PolynomialExpansion
+  Discrete Cosine Transform 
(DCT)
+  StringIndexer
+  IndexToString
+  OneHotEncoder
+  VectorIndexer
+  Interaction
+  Normalizer
+  StandardScaler
+  MinMaxScaler
+  MaxAbsScaler
+  Bucketizer
+  ElementwiseProduct
+  SQLTransformer
+  VectorAssembler
+  QuantileDiscretizer
 
   
-  Feature 
Selectors
-  VectorSlicer
-  RFormula
-  ChiSqSelector
+  Feature Selectors
+  VectorSlicer
+  RFormula
+  ChiSqSelector
 
   
-  Locality Sensitive Hashing

-  LSH 
Operations
-  Feature Transformation
-  Approximate Similarity 
Join
-  Approximate Nearest 
Neighbor Search
+  Locality Sensitive Hashing

+  LSH Operations
+  Feature Transformation
+  Approximate Similarity 
Join
+  Approximate 
Nearest Neighbor Search
 
   
-  LSH 
Algorithms
-  Bucketed 
Random Projection for Euclidean Distance
-  MinHash for Jaccard 
Distance
+  LSH Algorithms
+  Bucketed Random 
Projection for Euclidean Distance
+  MinHash for Jaccard 
Distance
 
   
 
@@ -395,7 +395,7 @@ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
 There are several variants on the definition of term frequency and document 
frequency.
 In MLlib, we separate TF and IDF to make them flexible.
 
-TF: Both HashingTF and 
CountVectorizer can be used to generate the term frequency 
vectors.
+TF: Both HashingTF and 
CountVectorizer can be used to generate the term frequency 
vectors. 
 
 HashingTF is a Transformer which takes sets of 
terms and converts those sets into 
 fixed-length feature vectors.  In text processing, a set of 
terms might be a bag of words.
@@ -437,7 +437,7 @@ when using text as features.  Our feature vectors could 
then be passed to a lear
 Refer to the HashingTF 
Scala docs and
 the IDF Scala 
docs for more details on the API.
 
-import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
+import 
org.apache.spark.ml.feature.{HashingTF, 
IDF, Tokenizer}
 
 val sentenceData = spark.createDataFrame(Seq(
   (0.0, 
Hi I heard about Spark),
@@ -468,7 +468,7 @@ the IDF Scala doc
 Refer to the HashingTF Java 
docs and the
 IDF Java docs for 
more details on the API.
 
-import java.util.Arrays;
+import 
java.util.Arrays;
 import java.util.List;
 
 import org.apache.spark.ml.feature.HashingTF;
@@ -489,17 +489,17 @@ the IDF Scala doc
   RowFactory.create(0.0, I wish Java 
could use case classes),
   RowFactory.create(1.0, Logistic 
regression models are neat)
 );
-StructType schema = new StructType(new 
StructField[]{
-  new StructField(label, DataTypes.DoubleType, 
false, Metadata.empty()),
-  new StructField(sentence, DataTypes.StringType, 
false, Metadata.empty())
+StructType schema = new StructType(new 
StructField[]{
+  new StructField(label, DataTypes.DoubleType, 
false, Metadata.empty()),
+  new StructField(sentence, DataTypes.StringType, 
false, Metadata.empty())
 });
 DatasetRow sentenceData = spark.createDataFrame(data, schema);
 
-Tokenizer tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words);
+Tokenizer tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words);
 DatasetRow wordsData = tokenizer.transform(sentenceData);
 
 int numFeatures = 20;
-HashingTF hashingTF = new HashingTF()
+HashingTF hashingTF = new HashingTF()
   .setInputCol(words)
   .setOutputCol(rawFeatures)
   .setNumFeatures(numFeatures);
@@ -507,7 +507,7 @@ the IDF Scala doc
 DatasetRow featurizedData = hashingTF.transform(wordsData);
 // alternatively, CountVectorizer can also be used to get 
term frequency vectors
 
-IDF idf = 
new IDF().setInputCol(rawFeatures).setOutputCol(features);
+IDF

[22/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-classification-regression.html
--
diff --git a/site/docs/2.1.0/ml-classification-regression.html 
b/site/docs/2.1.0/ml-classification-regression.html
index 1e0665b..0b264bb 100644
--- a/site/docs/2.1.0/ml-classification-regression.html
+++ b/site/docs/2.1.0/ml-classification-regression.html
@@ -329,58 +329,58 @@ discussing specific classes of algorithms, such as linear 
methods, trees, and en
 Table of Contents
 
 
-  Classification
-  Logistic regression
-  Binomial logistic 
regression
-  Multinomial logistic 
regression
+  Classification
+  Logistic regression
+  Binomial logistic 
regression
+  Multinomial logistic 
regression
 
   
-  Decision tree classifier
-  Random forest classifier
-  Gradient-boosted tree 
classifier
-  Multilayer perceptron 
classifier
-  One-vs-Rest classifier 
(a.k.a. One-vs-All)
-  Naive 
Bayes
+  Decision tree classifier
+  Random forest classifier
+  Gradient-boosted tree 
classifier
+  Multilayer perceptron 
classifier
+  One-vs-Rest 
classifier (a.k.a. One-vs-All)
+  Naive Bayes
 
   
-  Regression
-  Linear regression
-  Generalized linear 
regression
-  Available families
+  Regression
+  Linear regression
+  Generalized linear 
regression
+  Available families
 
   
-  Decision tree regression
-  Random forest regression
-  Gradient-boosted tree 
regression
-  Survival regression
-  Isotonic regression
-  Examples
+  Decision tree regression
+  Random forest regression
+  Gradient-boosted tree 
regression
+  Survival regression
+  Isotonic regression
+  Examples
 
   
 
   
-  Linear 
methods
-  Decision 
trees
-  Inputs and Outputs
-  Input 
Columns
-  Output Columns
+  Linear methods
+  Decision trees
+  Inputs and Outputs
+  Input Columns
+  Output Columns
 
   
 
   
-  Tree 
Ensembles
-  Random 
Forests
-  Inputs and Outputs
-  Input Columns
-  Output Columns 
(Predictions)
+  Tree Ensembles
+  Random Forests
+  Inputs and Outputs   
 
+  Input Columns
+  Output Columns 
(Predictions)
 
   
 
   
-  Gradient-Boosted Trees (GBTs) 
   
-  Inputs and Outputs
-  Input Columns
-  Output Columns 
(Predictions)
+  Gradient-Boosted Trees 
(GBTs)
+  Inputs and Outputs   
 
+  Input Columns
+  Output Columns 
(Predictions)
 
   
 
@@ -407,7 +407,7 @@ parameter to select between these two algorithms, or leave 
it unset and Spark wi
 
 Binomial logistic regression
 
-For more background and more details about the implementation of binomial 
logistic regression, refer to the documentation of logistic regression in 
spark.mllib.
+For more background and more details about the implementation of binomial 
logistic regression, refer to the documentation of logistic regression in 
spark.mllib. 
 
 Example
 
@@ -421,7 +421,7 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 
 More details on parameters can be found in the Scala
 API documentation.
 
-import org.apache.spark.ml.classification.LogisticRegression
+import 
org.apache.spark.ml.classification.LogisticRegression
 
 // Load training data
 val training = spark.read.format(libsvm).load(data/mllib/sample_libsvm_data.txt)
@@ -435,7 +435,7 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 val lrModel = lr.fit(training)
 
 // Print the coefficients and intercept for logistic 
regression
-println(sCoefficients: ${lrModel.coefficients} 
Intercept: ${lrModel.intercept})
+println(sCoefficients: ${lrModel.coefficients} 
Intercept: ${lrModel.intercept})
 
 // We can also use the multinomial family for binary 
classification
 val mlr = 
new LogisticRegression()
@@ -447,8 +447,8 @@ $\alpha$ and regParam corresponds to 
$\lambda$.
 val mlrModel = mlr.fit(training)
 
 // Print the coefficients and intercepts for logistic 
regression with multinomial family
-println(sMultinomial coefficients: 
${mlrModel.coefficientMatrix})
-println(sMultinomial intercepts: 
${mlrModel.interceptVector})
+println(sMultinomial coefficients: ${mlrModel.coefficientMatrix})
+println(sMultinomial intercepts: ${mlrModel.interceptVector})
 
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala"
 in the Spark repo.
   
@@ -457,7 +457,7 @@ $\alpha$ and regParam corresponds to 
$\lambda$.

[05/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/storage-openstack-swift.html
--
diff --git a/site/docs/2.1.0/storage-openstack-swift.html 
b/site/docs/2.1.0/storage-openstack-swift.html
index bbb3446..a20c67f 100644
--- a/site/docs/2.1.0/storage-openstack-swift.html
+++ b/site/docs/2.1.0/storage-openstack-swift.html
@@ -144,7 +144,7 @@ Current Swift driver requires Swift to use Keystone 
authentication method.
 The Spark application should include hadoop-openstack 
dependency.
 For example, for Maven support, add the following to the pom.xml 
file:
 
-dependencyManagement
+dependencyManagement
   ...
   dependency
 groupIdorg.apache.hadoop/groupId
@@ -152,15 +152,15 @@ For example, for Maven support, add the following to the 
pom.xml fi
 version2.3.0/version
   /dependency
   ...
-/dependencyManagement
+/dependencyManagement
 
 Configuration Parameters
 
 Create core-site.xml and place it inside Sparks 
conf directory.
 There are two main categories of parameters that should to be configured: 
declaration of the
-Swift driver and the parameters that are required by Keystone.
+Swift driver and the parameters that are required by Keystone. 
 
-Configuration of Hadoop to use Swift File system achieved via
+Configuration of Hadoop to use Swift File system achieved via 
 
 
 Property NameValue
@@ -221,7 +221,7 @@ contains a list of Keystone mandatory parameters. 
PROVIDER can be a
 For example, assume PROVIDER=SparkTest and Keystone contains 
user tester with password testing
 defined for tenant test. Then core-site.xml should 
include:
 
-configuration
+configuration
   property
 namefs.swift.impl/name
 valueorg.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem/value
@@ -257,7 +257,7 @@ defined for tenant test. Then 
core-site.xml should inc
 namefs.swift.service.SparkTest.password/name
 valuetesting/value
   /property
-/configuration
+/configuration
 
 Notice that
 fs.swift.service.PROVIDER.tenant,

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-custom-receivers.html
--
diff --git a/site/docs/2.1.0/streaming-custom-receivers.html 
b/site/docs/2.1.0/streaming-custom-receivers.html
index d31647d..846c797 100644
--- a/site/docs/2.1.0/streaming-custom-receivers.html
+++ b/site/docs/2.1.0/streaming-custom-receivers.html
@@ -171,7 +171,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
 
 
-class CustomReceiver(host: String, port: Int)
+class CustomReceiver(host: String, port: Int)
   extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
 
   def onStart() {
@@ -216,12 +216,12 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 restart(Error receiving data, t)
 }
   }
-}
+}
 
   
 
 
-public class 
JavaCustomReceiver extends 
ReceiverString {
+public class JavaCustomReceiver extends ReceiverString 
{
 
   String host = null;
   int port = -1;
@@ -234,7 +234,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
   public void onStart() {
 // Start the thread that receives data over a 
connection
-new Thread()  {
+new Thread()  {
   @Override public void run() 
{
 receive();
   }
@@ -253,10 +253,10 @@ has any error connecting or receiving, the receiver is 
restarted to make another
 
 try {
   // connect to the server
-  socket = new Socket(host, port);
+  socket = new Socket(host, port);
 
-  BufferedReader reader 
= new BufferedReader(
-new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
+  BufferedReader reader 
= new BufferedReader(
+new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8));
 
   // Until stopped or connection broken continue 
reading
   while (!isStopped()  (userInput = reader.readLine()) != 
null) {
@@ -276,7 +276,7 @@ has any error connecting or receiving, the receiver is 
restarted to make another
   restart(Error receiving data, t);
 }
   }
-}
+}
 
   
 
@@ -290,20 +290,20 @@ an input DStream using data received by the instance of 
custom receiver, as show
 
 
 
-// Assuming ssc is the 
StreamingContext
+// Assuming ssc is the 
StreamingContext
 val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
 val words = lines.flatMap(_.split( ))
-...
+...
 
 The full source code is in the example https://github.com/apache/spark/blob/v2.1.0/examples/src/main/scala/org/apache/spark/examples/streaming/CustomReceiver.scala;>CustomReceiver.scala.
 
   
 
 
-// Assuming ssc is the 
JavaStreamingContext
-JavaDStreamString customReceiverStream = ssc.receiverStream(new JavaCustomReceiver(host, port));
+// Assuming ssc is the 
JavaStreamingContext

[21/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-clustering.html
--
diff --git a/site/docs/2.1.0/ml-clustering.html 
b/site/docs/2.1.0/ml-clustering.html
index e225281..df38605 100644
--- a/site/docs/2.1.0/ml-clustering.html
+++ b/site/docs/2.1.0/ml-clustering.html
@@ -313,21 +313,21 @@ about these algorithms.
 Table of Contents
 
 
-  K-means
-  Input 
Columns
-  Output 
Columns
-  Example
+  K-means
+  Input Columns
+  Output Columns
+  Example
 
   
-  Latent Dirichlet allocation 
(LDA)
-  Bisecting k-means
-  Example
+  Latent Dirichlet allocation 
(LDA)
+  Bisecting k-means
+  Example
 
   
-  Gaussian Mixture Model (GMM)   
 
-  Input 
Columns
-  Output Columns
-  Example
+  Gaussian Mixture Model (GMM)   
 
+  Input Columns
+  Output Columns
+  Example
 
   
 
@@ -391,7 +391,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 
 Refer to the Scala API 
docs for more details.
 
-import org.apache.spark.ml.clustering.KMeans
+import 
org.apache.spark.ml.clustering.KMeans
 
 // Loads data.
 val dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt)
@@ -402,7 +402,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 
 // Evaluate clustering by computing Within Set Sum of Squared 
Errors.
 val WSSSE = model.computeCost(dataset)
-println(sWithin Set Sum of Squared Errors = 
$WSSSE)
+println(sWithin Set Sum of Squared Errors = $WSSSE)
 
 // Shows the result.
 println(Cluster Centers: )
@@ -414,7 +414,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 
 Refer to the Java API docs 
for more details.
 
-import org.apache.spark.ml.clustering.KMeansModel;
+import 
org.apache.spark.ml.clustering.KMeansModel;
 import org.apache.spark.ml.clustering.KMeans;
 import org.apache.spark.ml.linalg.Vector;
 import org.apache.spark.sql.Dataset;
@@ -424,7 +424,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 DatasetRow dataset 
= spark.read().format(libsvm).load(data/mllib/sample_kmeans_data.txt);
 
 // Trains a k-means model.
-KMeans kmeans = new KMeans().setK(2).setSeed(1L);
+KMeans kmeans = new KMeans().setK(2).setSeed(1L);
 KMeansModel model = kmeans.fit(dataset);
 
 // Evaluate clustering by computing Within Set Sum of Squared 
Errors.
@@ -434,7 +434,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 // Shows the result.
 Vector[] centers = model.clusterCenters();
 System.out.println(Cluster 
Centers: );
-for (Vector center: centers) {
+for (Vector center: 
centers) {
   System.out.println(center);
 }
 
@@ -444,22 +444,22 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 
 Refer to the Python API 
docs for more details.
 
-from pyspark.ml.clustering import KMeans
+from 
pyspark.ml.clustering import 
KMeans
 
-# Loads data.
-dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt)
+# Loads data.
+dataset = spark.read.format(libsvm).load(data/mllib/sample_kmeans_data.txt)
 
-# Trains a k-means model.
+# Trains a k-means model.
 kmeans = KMeans().setK(2).setSeed(1)
 model = kmeans.fit(dataset)
 
-# Evaluate clustering by computing Within Set Sum of Squared 
Errors.
+# Evaluate clustering by computing Within Set Sum of Squared 
Errors.
 wssse = model.computeCost(dataset)
-print(Within Set Sum of Squared Errors =  + str(wssse))
+print(Within Set Sum of Squared Errors =  + str(wssse))
 
-# Shows the result.
+# Shows the result.
 centers = model.clusterCenters()
-print(Cluster Centers: )
+print(Cluster Centers: )
 for center in centers:
 print(center)
 
@@ -470,7 +470,7 @@ called http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf;>kmea
 
 Refer to the R API docs for more 
details.
 
-# Fit a k-means model with 
spark.kmeans
+# Fit a k-means 
model with spark.kmeans
 irisDF - suppressWarnings(createDataFrame(iris))
 kmeansDF - irisDF
 kmeansTestDF - irisDF
@@ -504,7 +504,7 @@ and generates a LDAModel as the base model. 
Expert users may cast a
 
 Refer to the Scala API 
docs for more details.
 
-import org.apache.spark.ml.clustering.LDA
+import 
org.apache.spark.ml.clustering.LDA
 
 // Loads data.
 val dataset = spark.read.format(libsvm)
@@ -516,8 +516,8 @@ and generates a LDAModel as the base model. 
Expert users may cast a
 
 val ll = 
model.logLikelihood(dataset)
 val lp = 
model.logPerplexity(dataset)
-println(sThe lower bound on the log likelihood 
of the entire corpus: $ll)
-println(sThe upper bound bound on perplexity: 
$lp)
+println(sThe lower bound on the log likelihood of the entire corpus: 
$ll)
+println(sThe upper bound bound on perplexity: $lp)
 
 // Describe topics.
 val topics = model.describeTopics(3)
@@ -535,7 +535,7 @@ and generates a LDAModel as the base

[02/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-programming-guide.html
--
diff --git a/site/docs/2.1.0/structured-streaming-programming-guide.html 
b/site/docs/2.1.0/structured-streaming-programming-guide.html
index e54c101..3a1ac5f 100644
--- a/site/docs/2.1.0/structured-streaming-programming-guide.html
+++ b/site/docs/2.1.0/structured-streaming-programming-guide.html
@@ -127,45 +127,50 @@
 
 
 
-  Overview
-  Quick 
Example
-  Programming Model
-  Basic 
Concepts
-  Handling Event-time and 
Late Data
-  Fault Tolerance Semantics
+  Overview
+  Quick Example
+  Programming Model
+  Basic Concepts
+  Handling Event-time and 
Late Data
+  Fault Tolerance 
Semantics
 
   
-  API using Datasets and 
DataFrames
-  Creating 
streaming DataFrames and streaming Datasets
-  Data 
Sources
-  Schema
 inference and partition of streaming DataFrames/Datasets
+  API using Datasets and 
DataFrames
+  Creating streaming 
DataFrames and streaming Datasets
+  Data Sources
+  Schema 
inference and partition of streaming DataFrames/Datasets
 
   
-  Operations on 
streaming DataFrames/Datasets
-  Basic 
Operations - Selection, Projection, Aggregation
-  Window Operations on Event 
Time
-  Join Operations
-  Unsupported Operations
+  Operations on 
streaming DataFrames/Datasets
+  Basic Operations - 
Selection, Projection, Aggregation
+  Window Operations on 
Event Time
+  Handling Late 
Data and Watermarking
+  Join Operations
+  Unsupported Operations
 
   
-  Starting Streaming Queries 
   
-  Output 
Modes
-  Output 
Sinks
-  Using 
Foreach
+  Starting Streaming Queries 
   
+  Output Modes
+  Output Sinks
+  Using Foreach
 
   
-  Managing Streaming Queries
-  Monitoring Streaming 
Queries
-  Recovering from 
Failures with Checkpointing
+  Managing Streaming 
Queries
+  Monitoring Streaming 
Queries
+  Interactive APIs
+  Asynchronous API
+
+  
+  Recovering 
from Failures with Checkpointing
 
   
-  Where to go from here
+  Where to go from here
 
 
 Overview
 Structured Streaming is a scalable and fault-tolerant stream processing 
engine built on the Spark SQL engine. You can express your streaming 
computation the same way you would express a batch computation on static 
data.The Spark SQL engine will take care of running it incrementally and 
continuously and updating the final result as streaming data continues to 
arrive. You can use the Dataset/DataFrame 
API in Scala, Java or Python to express streaming aggregations, event-time 
windows, stream-to-batch joins, etc. The computation is executed on the same 
optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once 
fault-tolerance guarantees through checkpointing and Write Ahead Logs. In 
short, Structured Streaming provides fast, scalable, fault-tolerant, 
end-to-end exactly-once stream processing without the user having to reason 
about streaming.
 
-Spark 2.0 is the ALPHA RELEASE of Structured Streaming and 
the APIs are still experimental. In this guide, we are going to walk you 
through the programming model and the APIs. First, lets start with a 
simple example - a streaming word count.
+Structured Streaming is still ALPHA in Spark 2.1 and the 
APIs are still experimental. In this guide, we are going to walk you through 
the programming model and the APIs. First, lets start with a simple 
example - a streaming word count. 
 
 Quick Example
 Letâs say you want to maintain a running word count of text data received 
from a data server listening on a TCP socket. Letâs see how you can express 
this using Structured Streaming. You can see the full code in 
@@ -175,7 +180,7 @@ And if you http://spark.apache.org/downloads.html;>download Spark,
 
 
 
-import org.apache.spark.sql.functions._
+import org.apache.spark.sql.functions._
 import org.apache.spark.sql.SparkSession
 
 val spark = SparkSession
@@ -183,12 +188,12 @@ And if you http://spark.apache.org/downloads.html;>download Spark,
   .appName(StructuredNetworkWordCount)
   .getOrCreate()
   
-import spark.implicits._
+import spark.implicits._
 
   
 
 
-import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.FlatMapFunction;
 import org.apache.spark.sql.*;
 import org.apache.spark.sql.streaming.StreamingQuery;
 
@@ -198,19 +203,19 @@ And if you http://spark.apache.org/downloads.html;>download Spark,
 SparkSession spark = SparkSession
   .builder()
   .appName(JavaStructuredNetworkWordCount)
-  .getOrCreate();

[10/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-pmml-model-export.html
--
diff --git a/site/docs/2.1.0/mllib-pmml-model-export.html 
b/site/docs/2.1.0/mllib-pmml-model-export.html
index 30815e0..3f2fd91 100644
--- a/site/docs/2.1.0/mllib-pmml-model-export.html
+++ b/site/docs/2.1.0/mllib-pmml-model-export.html
@@ -307,8 +307,8 @@
 
 
 
-  spark.mllib 
supported models
-  Examples
+  spark.mllib 
supported models
+  Examples
 
 
 spark.mllib supported 
models
@@ -353,32 +353,31 @@
 
 Refer to the KMeans
 Scala docs and Vectors
 Scala docs for details on the API.
 
-Here a complete example of building a KMeansModel and print it out in 
PMML format:
-import org.apache.spark.mllib.clustering.KMeans
-import org.apache.spark.mllib.linalg.Vectors
+Here a complete example of building a KMeansModel and print it out in 
PMML format:
+div class="highlight"preimport org.apache.spark.mllib.clustering.KMeans
+import org.apache.spark.mllib.linalg.Vectors
 
-// Load and parse the data
+// Load and parse the data
 val data = sc.textFile(data/mllib/kmeans_data.txt)
-val parsedData = data.map(s = Vectors.dense(s.split( ).map(_.toDouble))).cache()
+val parsedData = data.map(s = Vectors.dense(s.split( ).map(_.toDouble))).cache()
 
-// Cluster the data into two classes using KMeans
+// Cluster the data into two classes using 
KMeans
 val numClusters = 2
 val numIterations = 20
-val clusters = KMeans.train(parsedData, numClusters, numIterations)
+val clusters = KMeans.train(parsedData, numClusters, numIterations)
 
-// Export to PMML to a String in PMML format
-println(PMML Model:\n + clusters.toPMML)
+// Export to PMML to a String in PMML format
+println(PMML Model:\n + clusters.toPMML)
 
-// Export the model to a local file in PMML format
-clusters.toPMML(/tmp/kmeans.xml)
+// Export the model to a local file in PMML 
format
+clusters.toPMML(/tmp/kmeans.xml)
 
-// Export the model to a directory on a distributed file 
system in PMML format
-clusters.toPMML(sc, /tmp/kmeans)
+// Export the model to a directory on a distributed 
file system in PMML format
+clusters.toPMML(sc, /tmp/kmeans)
 
-// Export the model to the OutputStream in PMML format
+// Export the model to the OutputStream in PMML 
format
 clusters.toPMML(System.out)
-
-Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala"
 in the Spark repo.
+/pre/divdivFind full example code at 
examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala
 in the Spark repo./div
 
 For unsupported models, either you will not find a .toPMML 
method or an IllegalArgumentException will be thrown.
 

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-statistics.html
--
diff --git a/site/docs/2.1.0/mllib-statistics.html 
b/site/docs/2.1.0/mllib-statistics.html
index 4485ecf..f04924c 100644
--- a/site/docs/2.1.0/mllib-statistics.html
+++ b/site/docs/2.1.0/mllib-statistics.html
@@ -358,15 +358,15 @@
 
 
 
-  Summary statistics
-  Correlations
-  Stratified sampling
-  Hypothesis testing
-  Streaming Significance 
Testing
+  Summary statistics
+  Correlations
+  Stratified sampling
+  Hypothesis testing
+  Streaming Significance 
Testing
 
   
-  Random data generation
-  Kernel density estimation
+  Random data generation
+  Kernel density estimation
 
 
 \[
@@ -401,7 +401,7 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors
+import 
org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
 
 val observations = sc.parallelize(
@@ -430,7 +430,7 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Java docs for details on the API.
 
-import java.util.Arrays;
+import 
java.util.Arrays;
 
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.Vector;
@@ -463,19 +463,19 @@ total count.
 
 Refer to the MultivariateStatisticalSummary
 Python docs for more details on the API.
 
-import numpy as np
+import 
numpy as np
 
 from pyspark.mllib.stat import Statistics
 
 mat = sc.parallelize(
 [np.array([1.0, 10.0, 100.0]), 
np.array([2.0, 20.0, 200.0]), np.array([3.0, 30.0, 300.0])]
-)  # an RDD of Vectors
+)  # an RDD of Vectors
 
-# Compute column summary statistics.
+# Compute column summary statistics.
 summary = Statistics.colStats(mat)
-print(summary.mean())  # a dense 
vector containing the mean value for each column
-print(summary.variance())  # 
column-wise variance
-print(summary.numNonzeros())  # 
number of nonzeros in

[09/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/programming-guide.html
--
diff --git a/site/docs/2.1.0/programming-guide.html 
b/site/docs/2.1.0/programming-guide.html
index 12458af..0e06e86 100644
--- a/site/docs/2.1.0/programming-guide.html
+++ b/site/docs/2.1.0/programming-guide.html
@@ -129,50 +129,50 @@
 
 
 
-  Overview
-  Linking with Spark
-  Initializing Spark
-  Using 
the Shell
+  Overview
+  Linking with Spark
+  Initializing Spark
+  Using the Shell
 
   
-  Resilient Distributed 
Datasets (RDDs)
-  Parallelized Collections
-  External Datasets
-  RDD 
Operations
-  Basics
-  Passing Functions to Spark
-  Understanding 
closures 
-  Example
-  Local vs. cluster modes
-  Printing elements of an 
RDD
+  Resilient Distributed 
Datasets (RDDs)
+  Parallelized Collections
+  External Datasets
+  RDD Operations
+  Basics
+  Passing Functions to 
Spark
+  Understanding closures 
+  Example
+  Local vs. cluster 
modes
+  Printing elements of 
an RDD
 
   
-  Working with Key-Value 
Pairs
-  Transformations
-  Actions
-  Shuffle operations
-  Background
-  Performance Impact
+  Working with Key-Value 
Pairs
+  Transformations
+  Actions
+  Shuffle operations

+  Background
+  Performance Impact
 
   
 
   
-  RDD 
Persistence
-  Which Storage Level to 
Choose?
-  Removing Data
+  RDD Persistence
+  Which Storage Level to 
Choose?
+  Removing Data
 
   
 
   
-  Shared 
Variables
-  Broadcast Variables
-  Accumulators
+  Shared Variables
+  Broadcast Variables
+  Accumulators
 
   
-  Deploying to a Cluster
-  Launching Spark jobs 
from Java / Scala
-  Unit 
Testing
-  Where to Go from Here
+  Deploying to a Cluster
+  Launching Spark jobs 
from Java / Scala
+  Unit Testing
+  Where to Go from Here
 
 
 Overview
@@ -212,8 +212,8 @@ version = your-hdfs-version
 
 Finally, you need to import some Spark classes into your program. Add 
the following lines:
 
-import org.apache.spark.SparkContext
-import org.apache.spark.SparkConf
+import org.apache.spark.SparkContext
+import org.apache.spark.SparkConf
 
 (Before Spark 1.3.0, you need to explicitly import 
org.apache.spark.SparkContext._ to enable essential implicit 
conversions.)
 
@@ -245,9 +245,9 @@ version = your-hdfs-version
 
 Finally, you need to import some Spark classes into your program. Add 
the following lines:
 
-import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.api.java.JavaRDD
-import org.apache.spark.SparkConf
+import org.apache.spark.SparkConf
 
   
 
@@ -269,13 +269,13 @@ for common HDFS versions.
 
 Finally, you need to import some Spark classes into your program. Add 
the following line:
 
-from pyspark 
import SparkContext, SparkConf
+from pyspark import SparkContext, SparkConf
 
 PySpark requires the same minor version of Python in both driver and 
workers. It uses the default python version in PATH,
 you can specify which version of Python you want to use by 
PYSPARK_PYTHON, for example:
 
-$ PYSPARK_PYTHON=python3.4 bin/pyspark
-$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
+$ PYSPARK_PYTHON=python3.4 bin/pyspark
+$ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit 
examples/src/main/python/pi.py
 
   
 
@@ -293,8 +293,8 @@ that contains information about your application.
 
 Only one SparkContext may be active per JVM.  You must 
stop() the active SparkContext before creating a new one.
 
-val conf = new SparkConf().setAppName(appName).setMaster(master)
-new SparkContext(conf)
+val conf = new SparkConf().setAppName(appName).setMaster(master)
+new SparkContext(conf)
 
   
 
@@ -304,8 +304,8 @@ that contains information about your application.
 how to access a cluster. To create a SparkContext you first need 
to build a SparkConf object
 that contains information about your application.
 
-SparkConf conf 
= new SparkConf().setAppName(appName).setMaster(master);
-JavaSparkContext sc = new JavaSparkContext(conf);
+SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
+JavaSparkContext sc = new JavaSparkContext(conf);
 
   
 
@@ -315,8 +315,8 @@ that contains information about your application.
 how to access a cluster. To create a SparkContext you first need 
to build a SparkConf 
object
 that

[07/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/sparkr.html
--
diff --git a/site/docs/2.1.0/sparkr.html b/site/docs/2.1.0/sparkr.html
index 0a1a347..e861a01 100644
--- a/site/docs/2.1.0/sparkr.html
+++ b/site/docs/2.1.0/sparkr.html
@@ -127,53 +127,53 @@
 
 
 
-  Overview
-  SparkDataFrame
-  Starting Up: SparkSession
-  Starting Up from RStudio
-  Creating SparkDataFrames

-  From local data frames
-  From Data Sources
-  From Hive tables
+  Overview
+  SparkDataFrame
+  Starting Up: 
SparkSession
+  Starting Up from RStudio
+  Creating SparkDataFrames 
   
+  From local data frames
+  From Data Sources
+  From Hive tables
 
   
-  SparkDataFrame Operations   
 
-  Selecting rows, columns
-  Grouping, Aggregation
-  Operating on Columns
-  Applying User-Defined 
Function
-  Run
 a given function on a large dataset using dapply or 
dapplyCollect
-  dapply
-  dapplyCollect
+  SparkDataFrame Operations   
 
+  Selecting rows, 
columns
+  Grouping, Aggregation
+  Operating on Columns
+  Applying User-Defined 
Function
+  Run
 a given function on a large dataset using dapply or 
dapplyCollect
+  dapply
+  dapplyCollect
 
   
-  Run
 a given function on a large dataset grouping by input column(s) and using 
gapply or gapplyCollect
-  gapply
-  gapplyCollect
+  Run
 a given function on a large dataset grouping by input column(s) and using 
gapply or gapplyCollect
+  gapply
+  gapplyCollect
 
   
-  Data type mapping 
between R and Spark
-  Run local 
R functions distributed using spark.lapply
-  spark.lapply
+  Data type 
mapping between R and Spark
+  Run local R 
functions distributed using spark.lapply
+  spark.lapply
 
   
 
   
 
   
-  Running SQL Queries from 
SparkR
+  Running SQL Queries from 
SparkR
 
   
-  Machine 
Learning
-  Algorithms
-  Model persistence
+  Machine Learning
+  Algorithms
+  Model persistence
 
   
-  R Function Name Conflicts
-  Migration 
Guide
-  Upgrading From SparkR 1.5.x 
to 1.6.x
-  Upgrading From SparkR 1.6.x 
to 2.0
-  Upgrading to SparkR 2.1.0
+  R Function Name Conflicts
+  Migration Guide
+  Upgrading From SparkR 
1.5.x to 1.6.x
+  Upgrading From SparkR 
1.6.x to 2.0
+  Upgrading to SparkR 2.1.0
 
   
 
@@ -202,7 +202,7 @@ You can create a SparkSession using 
sparkR.session and
 
   
 
-sparkR.session()
+sparkR.session()
 
   
 
@@ -223,11 +223,11 @@ them, pass them as you would other configuration 
properties in the sparkCo
 
   
 
-if (nchar(Sys.getenv(SPARK_HOME))  1) {
+if (nchar(Sys.getenv(SPARK_HOME))  1) {
   Sys.setenv(SPARK_HOME = /home/spark)
 }
 library(SparkR, lib.loc = c(file.path(Sys.getenv(SPARK_HOME), R, lib)))
-sparkR.session(master = local[*], sparkConfig = list(spark.driver.memory = 2g))
+sparkR.session(master = local[*], sparkConfig = list(spark.driver.memory = 2g))
 
   
 
@@ -282,14 +282,14 @@ sparkR.session(master = 
 
-  df - as.DataFrame(faithful)
+  df - as.DataFrame(faithful)
 
 # Displays the first part of the SparkDataFrame
 head(df)
 ##  eruptions waiting
 ##1 3.600  79
 ##2 1.800  54
-##3 3.333  74
+##3 3.333  74
 
 
 
@@ -303,7 +303,7 @@ specifying --packages with 
spark-submit or spark
 
 
 
-  sparkR.session(sparkPackages = com.databricks:spark-avro_2.11:3.0.0)
+  sparkR.session(sparkPackages 
= com.databricks:spark-avro_2.11:3.0.0)
 
 
 
@@ -311,7 +311,7 @@ specifying --packages with 
spark-submit or spark
 
 
 
-  people 
- read.df(./examples/src/main/resources/people.json, json)
+  people - read.df(./examples/src/main/resources/people.json, json)
 head(people)
 ##  agename
 ##1  NA Michael
@@ -325,7 +325,7 @@ printSchema(people)
 #  |-- name: string (nullable = true)
 
 # Similarly, multiple files can be read with read.json
-people - read.json(c(./examples/src/main/resources/people.json, ./examples/src/main/resources/people2.json))
+people - read.json(c(./examples/src/main/resources/people.json, ./examples/src/main/resources/people2.json))
 
 
 
@@ -333,7 +333,7 @@ people - read.json(
 
-  df - read.df(csvPath, csv, header = true, inferSchema = true, na.strings = NA)
+  df - read.df(csvPath, csv, header

[18/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/ml-tuning.html
--
diff --git a/site/docs/2.1.0/ml-tuning.html b/site/docs/2.1.0/ml-tuning.html
index 0c36a98..2246cc2 100644
--- a/site/docs/2.1.0/ml-tuning.html
+++ b/site/docs/2.1.0/ml-tuning.html
@@ -329,13 +329,13 @@ Built-in Cross-Validation and other tooling allow users 
to optimize hyperparamet
 Table of contents
 
 
-  Model selection 
(a.k.a. hyperparameter tuning)
-  Cross-Validation
-  Example: model 
selection via cross-validation
+  Model selection 
(a.k.a. hyperparameter tuning)
+  Cross-Validation
+  Example: 
model selection via cross-validation
 
   
-  Train-Validation Split
-  Example: 
model selection via train validation split
+  Train-Validation Split
+  Example: model 
selection via train validation split
 
   
 
@@ -396,7 +396,7 @@ However, it is also a well-established method for choosing 
parameters which is m
 
 Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for 
details on the API.
 
-import org.apache.spark.ml.Pipeline
+import org.apache.spark.ml.Pipeline
 import org.apache.spark.ml.classification.LogisticRegression
 import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
 import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
@@ -467,7 +467,7 @@ Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark
   .select(id, 
text, probability, prediction)
   .collect()
   .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =
-println(s($id, $text) -- prob=$prob, 
prediction=$prediction)
+println(s($id, 
$text) -- prob=$prob, prediction=$prediction)
   }
 Find full example code at 
"examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala"
 in the Spark repo.
 
@@ -476,7 +476,7 @@ Refer to the [`CrossValidator` Scala 
docs](api/scala/index.html#org.apache.spark
 
 Refer to the [`CrossValidator` Java 
docs](api/java/org/apache/spark/ml/tuning/CrossValidator.html) for details on 
the API.
 
-import java.util.Arrays;
+import java.util.Arrays;
 
 import org.apache.spark.ml.Pipeline;
 import org.apache.spark.ml.PipelineStage;
@@ -493,38 +493,38 @@ Refer to the [`CrossValidator` Java 
docs](api/java/org/apache/spark/ml/tuning/Cr
 
 // Prepare training documents, which are labeled.
 DatasetRow training 
= spark.createDataFrame(Arrays.asList(
-  new JavaLabeledDocument(0L, a b c d e spark, 1.0),
-  new JavaLabeledDocument(1L, b d, 0.0),
-  new JavaLabeledDocument(2L,spark f g h, 1.0),
-  new JavaLabeledDocument(3L, hadoop mapreduce, 0.0),
-  new JavaLabeledDocument(4L, b spark who, 1.0),
-  new JavaLabeledDocument(5L, g d a y, 0.0),
-  new JavaLabeledDocument(6L, spark fly, 1.0),
-  new JavaLabeledDocument(7L, was mapreduce, 0.0),
-  new JavaLabeledDocument(8L, e spark program, 1.0),
-  new JavaLabeledDocument(9L, a e c l, 0.0),
-  new JavaLabeledDocument(10L, spark compile, 1.0),
-  new JavaLabeledDocument(11L, hadoop software, 0.0)
+  new JavaLabeledDocument(0L, a b c d e spark, 1.0),
+  new JavaLabeledDocument(1L, b d, 0.0),
+  new JavaLabeledDocument(2L,spark f g h, 1.0),
+  new JavaLabeledDocument(3L, hadoop mapreduce, 0.0),
+  new JavaLabeledDocument(4L, b spark who, 1.0),
+  new JavaLabeledDocument(5L, g d a y, 0.0),
+  new JavaLabeledDocument(6L, spark fly, 1.0),
+  new JavaLabeledDocument(7L, was mapreduce, 0.0),
+  new JavaLabeledDocument(8L, e spark program, 1.0),
+  new JavaLabeledDocument(9L, a e c l, 0.0),
+  new JavaLabeledDocument(10L, spark compile, 1.0),
+  new JavaLabeledDocument(11L, hadoop software, 0.0)
 ), JavaLabeledDocument.class);
 
 // Configure an ML pipeline, which consists of three stages: 
tokenizer, hashingTF, and lr.
-Tokenizer tokenizer = new Tokenizer()
+Tokenizer tokenizer = new Tokenizer()
   .setInputCol(text)
   .setOutputCol(words);
-HashingTF hashingTF = new HashingTF()
+HashingTF hashingTF = new HashingTF()
   .setNumFeatures(1000)
   .setInputCol(tokenizer.getOutputCol())
   .setOutputCol(features);
-LogisticRegression lr = new LogisticRegression()
+LogisticRegression lr = new LogisticRegression()
   .setMaxIter(10)
   .setRegParam(0.01);
-Pipeline pipeline = new Pipeline()
+Pipeline pipeline = new Pipeline()
   .setStages(new PipelineStage[] {tokenizer, 
hashingTF, lr});
 
 // We use a ParamGridBuilder to construct a grid of 
parameters to search over.
 // With 3 values for hashingTF.numFeatures and 2 values for 
lr.regParam,
 // this grid will have 3 x 2 = 6 parameter settings for 
CrossValidator to choose from.
-ParamMap[] paramGrid = new 
ParamGridBuilder()
+ParamMap[] paramGrid = new 
ParamGridBuilder()
   .addGrid(hashingTF.numFeatures(), new int[] {10, 100, 1000})
   .addGrid(lr.regParam(), new 
double[] {0.1, 0.01})
   .build();
@@ -534,9 +534,9 @@ Refer to the [`CrossValidator`

[04/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/streaming-programming-guide.html
--
diff --git a/site/docs/2.1.0/streaming-programming-guide.html 
b/site/docs/2.1.0/streaming-programming-guide.html
index 9a87d23..b1ce1e1 100644
--- a/site/docs/2.1.0/streaming-programming-guide.html
+++ b/site/docs/2.1.0/streaming-programming-guide.html
@@ -129,32 +129,32 @@
 
 
 
-  Overview
-  A Quick 
Example
-  Basic 
Concepts
-  Linking
-  Initializing 
StreamingContext
-  Discretized Streams 
(DStreams)
-  Input DStreams and 
Receivers
-  Transformations on 
DStreams
-  Output Operations on 
DStreams
-  DataFrame and SQL 
Operations
-  MLlib 
Operations
-  Caching / Persistence
-  Checkpointing
-  Accumulators,
 Broadcast Variables, and Checkpoints
-  Deploying Applications
-  Monitoring Applications
+  Overview
+  A Quick Example
+  Basic Concepts
+  Linking
+  Initializing 
StreamingContext
+  Discretized Streams 
(DStreams)
+  Input DStreams and 
Receivers
+  Transformations on 
DStreams
+  Output Operations on 
DStreams
+  DataFrame and SQL 
Operations
+  MLlib Operations
+  Caching / Persistence
+  Checkpointing
+  Accumulators, 
Broadcast Variables, and Checkpoints
+  Deploying Applications
+  Monitoring Applications
 
   
-  Performance Tuning
-  Reducing the Batch 
Processing Times
-  Setting the Right Batch 
Interval
-  Memory 
Tuning
+  Performance Tuning
+  Reducing the Batch 
Processing Times
+  Setting the Right Batch 
Interval
+  Memory Tuning
 
   
-  Fault-tolerance Semantics
-  Where to Go from Here
+  Fault-tolerance Semantics
+  Where to Go from Here
 
 
 Overview
@@ -209,7 +209,7 @@ conversions from StreamingContext into our environment in 
order to add useful me
 other classes we need (like DStream). StreamingContext
 is the
 main entry point for all streaming functionality. We create a local 
StreamingContext with two execution threads,  and a batch interval of 1 
second.
 
-import org.apache.spark._
+import org.apache.spark._
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
 
@@ -217,33 +217,33 @@ main entry point for all streaming functionality. We 
create a local StreamingCon
 // The master requires 2 cores to prevent from a starvation 
scenario.
 
 val conf = new SparkConf().setMaster(local[2]).setAppName(NetworkWordCount)
-val ssc = 
new StreamingContext(conf, Seconds(1))
+val ssc = 
new StreamingContext(conf, Seconds(1))
 
 Using this context, we can create a DStream that represents streaming 
data from a TCP
 source, specified as hostname (e.g. localhost) and port (e.g. 
).
 
-// Create a DStream that will connect to 
hostname:port, like localhost:
-val lines = ssc.socketTextStream(localhost, )
+// Create a DStream that will 
connect to hostname:port, like localhost:
+val lines = ssc.socketTextStream(localhost, )
 
 This lines DStream represents the stream of data that will 
be received from the data
 server. Each record in this DStream is a line of text. Next, we want to split 
the lines by
 space characters into words.
 
-// Split each line into words
-val words = lines.flatMap(_.split( ))
+// Split each line into 
words
+val words = lines.flatMap(_.split( ))
 
 flatMap is a one-to-many DStream operation that creates a 
new DStream by
 generating multiple new records from each record in the source DStream. In 
this case,
 each line will be split into multiple words and the stream of words is 
represented as the
 words DStream.  Next, we want to count these words.
 
-import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
+import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
 // Count each word in each batch
 val pairs = words.map(word = (word, 1))
 val wordCounts = pairs.reduceByKey(_ 
+ _)
 
 // Print the first ten elements of each RDD generated in this 
DStream to the console
-wordCounts.print()
+wordCounts.print()
 
 The words DStream is further mapped (one-to-one 
transformation) to a DStream of (word,
 1) pairs, which is then reduced to get the frequency of words in each 
batch of data.
@@ -253,8 +253,8 @@ Finally, wordCounts.print() will print a few 
of the counts generate
 will perform when it is started, and no real processing has started yet. To 
start the processing
 after all the transformations have been setup, we finally call
 
-ssc.start() // 
Start the computation
-ssc.awaitTermination()  // 
Wait for the computation to terminate
+ssc.start() 
// Start the computation
+ssc.awaitTermination()  // 
Wait for the computation to terminate
 
 The

[15/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-data-types.html
--
diff --git a/site/docs/2.1.0/mllib-data-types.html 
b/site/docs/2.1.0/mllib-data-types.html
index 546d921..f7b5358 100644
--- a/site/docs/2.1.0/mllib-data-types.html
+++ b/site/docs/2.1.0/mllib-data-types.html
@@ -307,14 +307,14 @@
 
 
 
-  Local 
vector
-  Labeled 
point
-  Local 
matrix
-  Distributed matrix
-  RowMatrix
-  IndexedRowMatrix
-  CoordinateMatrix
-  BlockMatrix
+  Local vector
+  Labeled point
+  Local matrix
+  Distributed matrix
+  RowMatrix
+  IndexedRowMatrix
+  CoordinateMatrix
+  BlockMatrix
 
   
 
@@ -347,14 +347,14 @@ using the factory methods implemented in
 
 Refer to the Vector
 Scala docs and Vectors
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
 
 // Create a dense vector (1.0, 0.0, 3.0).
 val dv: 
Vector = Vectors.dense(1.0, 0.0, 3.0)
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
indices and values corresponding to nonzero entries.
 val sv1: 
Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
nonzero entries.
-val sv2: 
Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
+val sv2: 
Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
 
 Note:
 Scala imports scala.collection.immutable.Vector by default, so 
you have to import
@@ -373,13 +373,13 @@ using the factory methods implemented in
 
 Refer to the Vector 
Java docs and Vectors 
Java docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.mllib.linalg.Vectors;
 
 // Create a dense vector (1.0, 0.0, 3.0).
 Vector dv = Vectors.dense(1.0, 0.0, 3.0);
 // Create a sparse vector (1.0, 0.0, 3.0) by specifying its 
indices and values corresponding to nonzero entries.
-Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0});
+Vector sv = Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0});
 
   
 
@@ -405,18 +405,18 @@ in Ve
 
 Refer to the Vectors
 Python docs for more details on the API.
 
-import numpy 
as np
+import numpy as np
 import scipy.sparse as sps
 from pyspark.mllib.linalg 
import Vectors
 
-# Use a NumPy array as a dense vector.
+# Use a NumPy array as a dense vector.
 dv1 = np.array([1.0, 0.0, 3.0])
-# Use a Python list as a dense vector.
+# Use a Python list as a dense vector.
 dv2 = [1.0, 0.0, 3.0]
-# Create a SparseVector.
+# Create a SparseVector.
 sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
-# Use a single-column SciPy csc_matrix as a sparse 
vector.
-sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
+# Use a single-column SciPy csc_matrix as a sparse 
vector.
+sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
 
   
 
@@ -438,14 +438,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Scala docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint
 
 // Create a labeled point with a positive label and a dense 
feature vector.
 val pos = 
LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0))
 
 // Create a labeled point with a negative label and a sparse 
feature vector.
-val neg = 
LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
+val neg = 
LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
 
   
 
@@ -456,14 +456,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Java docs for details on the API.
 
-import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.linalg.Vectors;
 import org.apache.spark.mllib.regression.LabeledPoint;
 
 // Create a labeled point with a positive label and a dense 
feature vector.
-LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
+LabeledPoint pos = new LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0));
 
 // Create a labeled point with a negative label and a sparse 
feature vector.
-LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0}));
+LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new double[] 
{1.0, 
3.0}));
 
   
 
@@ -474,14 +474,14 @@ For multiclass classification, labels should be class 
indices starting from zero
 
 Refer to the LabeledPoint
 Python docs for more details on the API.
 
-from pyspark.mllib.linalg import SparseVector
+from pyspark.mllib.linalg

[03/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/structured-streaming-kafka-integration.html
--
diff --git a/site/docs/2.1.0/structured-streaming-kafka-integration.html 
b/site/docs/2.1.0/structured-streaming-kafka-integration.html
index 5ca9259..7d2254f 100644
--- a/site/docs/2.1.0/structured-streaming-kafka-integration.html
+++ b/site/docs/2.1.0/structured-streaming-kafka-integration.html
@@ -144,7 +144,7 @@ application. See the Deploying 
subsection below.
 
 
 
-// Subscribe to 1 topic
+// Subscribe to 1 topic
 val ds1 = 
spark
   .readStream
   .format(kafka)
@@ -172,12 +172,12 @@ application. See the Deploying 
subsection below.
   .option(subscribePattern, topic.*)
   .load()
 ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
-  .as[(String, String)]
+  .as[(String, String)]
 
   
 
 
-// Subscribe to 1 topic
+// Subscribe to 1 topic
 DatasetRow ds1 
= spark
   .readStream()
   .format(kafka)
@@ -202,43 +202,43 @@ application. See the Deploying 
subsection below.
   .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
   .option(subscribePattern, topic.*)
   .load()
-ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
+ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
 
   
 
 
-# Subscribe to 1 topic
+# Subscribe to 1 topic
 ds1 = spark
   .readStream()
-  .format(kafka)
-  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
-  .option(subscribe, topic1)
+  .format(kafka)
+  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
+  .option(subscribe, topic1)
   .load()
-ds1.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
+ds1.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
 
-# Subscribe to multiple topics
+# Subscribe to multiple topics
 ds2 = spark
   .readStream
-  .format(kafka)
-  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
-  .option(subscribe, topic1,topic2)
+  .format(kafka)
+  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
+  .option(subscribe, topic1,topic2)
   .load()
-ds2.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
+ds2.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
 
-# Subscribe to a pattern
+# Subscribe to a pattern
 ds3 = spark
   .readStream()
-  .format(kafka)
-  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
-  .option(subscribePattern, topic.*)
+  .format(kafka)
+  .option(kafka.bootstrap.servers, 
host1:port1,host2:port2)
+  .option(subscribePattern, topic.*)
   .load()
-ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
+ds3.selectExpr(CAST(key AS STRING), CAST(value AS STRING))
 
   
 
 
-Each row in the source has the following schema:
-
+Each row in the source has the following schema:
+table class="table"
 ColumnType
 
   key
@@ -268,7 +268,7 @@ application. See the Deploying 
subsection below.
   timestampType
   int
 
-
+/table
 
 The following options must be set for the Kafka source.
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[16/25] spark-website git commit: Update 2.1.0 docs to include https://github.com/apache/spark/pull/16294

2016-12-28 Thread yhuai

http://git-wip-us.apache.org/repos/asf/spark-website/blob/d2bcf185/site/docs/2.1.0/mllib-collaborative-filtering.html
--
diff --git a/site/docs/2.1.0/mllib-collaborative-filtering.html 
b/site/docs/2.1.0/mllib-collaborative-filtering.html
index e453032..b3f9e08 100644
--- a/site/docs/2.1.0/mllib-collaborative-filtering.html
+++ b/site/docs/2.1.0/mllib-collaborative-filtering.html
@@ -322,13 +322,13 @@
 
 
 
-  Collaborative filtering
-  Explicit vs. implicit 
feedback
-  Scaling of the 
regularization parameter
+  Collaborative filtering
+  Explicit vs. implicit 
feedback
+  Scaling of the 
regularization parameter
 
   
-  Examples
-  Tutorial
+  Examples
+  Tutorial
 
 
 Collaborative filtering
@@ -393,7 +393,7 @@ recommendation model by measuring the Mean Squared Error of 
rating prediction.Refer to the ALS
 Scala docs for more details on the API.
 
-import org.apache.spark.mllib.recommendation.ALS
+import 
org.apache.spark.mllib.recommendation.ALS
 import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
 import org.apache.spark.mllib.recommendation.Rating
 
@@ -434,9 +434,9 @@ recommendation model by measuring the Mean Squared Error of 
rating prediction.If the rating matrix is derived from another source of information 
(i.e. it is inferred from
 other signals), you can use the trainImplicit method to get 
better results.
 
-val alpha = 0.01
+val alpha = 0.01
 val lambda = 0.01
-val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
+val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)
 
   
 
@@ -449,7 +449,7 @@ that is equivalent to the provided example in Scala is 
given below:
 
 Refer to the ALS 
Java docs for more details on the API.
 
-import scala.Tuple2;
+import 
scala.Tuple2;
 
 import org.apache.spark.api.java.*;
 import org.apache.spark.api.java.function.Function;
@@ -458,8 +458,8 @@ that is equivalent to the provided example in Scala is 
given below:
 import org.apache.spark.mllib.recommendation.Rating;
 import org.apache.spark.SparkConf;
 
-SparkConf conf = new SparkConf().setAppName(Java 
Collaborative Filtering Example);
-JavaSparkContext jsc = new JavaSparkContext(conf);
+SparkConf conf = new SparkConf().setAppName(Java 
Collaborative Filtering Example);
+JavaSparkContext jsc = new JavaSparkContext(conf);
 
 // Load and parse the data
 String path = data/mllib/als/test.data;
@@ -468,7 +468,7 @@ that is equivalent to the provided example in Scala is 
given below:
   new FunctionString, 
Rating() {
 public Rating call(String 
s) {
   String[] sarray = s.split(,);
-  return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
+  return new Rating(Integer.parseInt(sarray[0]), Integer.parseInt(sarray[1]),
 Double.parseDouble(sarray[2]));
 }
   }
@@ -528,36 +528,36 @@ recommendation by measuring the Mean Squared Error of 
rating prediction.
 
 Refer to the ALS
 Python docs for more details on the API.
 
-from pyspark.mllib.recommendation import 
ALS, MatrixFactorizationModel, Rating
+from 
pyspark.mllib.recommendation import ALS, 
MatrixFactorizationModel, Rating
 
-# Load and parse the data
-data = sc.textFile(data/mllib/als/test.data)
-ratings = data.map(lambda l: l.split(,))\
+# Load and parse the data
+data = sc.textFile(data/mllib/als/test.data)
+ratings = data.map(lambda l: l.split(,))\
 .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
 
-# Build the recommendation model using Alternating Least 
Squares
+# Build the recommendation model using Alternating Least 
Squares
 rank = 10
 numIterations = 10
 model = ALS.train(ratings, rank, numIterations)
 
-# Evaluate the model on training data
+# Evaluate the model on training data
 testdata = ratings.map(lambda p: (p[0], p[1]))
 predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
 ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
 MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
-print(Mean 
Squared Error =  + str(MSE))
+print(Mean Squared Error =  + 
str(MSE))
 
-# Save and load model
-model.save(sc, target/tmp/myCollaborativeFilter)
-sameModel = MatrixFactorizationModel.load(sc, target/tmp/myCollaborativeFilter)
+# Save and load model
+model.save(sc, target/tmp/myCollaborativeFilter)
+sameModel = MatrixFactorizationModel.load(sc, target/tmp/myCollaborativeFilter)
 
 Find full example code at 
"examples/src/main/python/mllib/recommendation_example.py" in the Spark 
repo.
 
 If the rating matrix is derived from other source of information (i.e. 
it is inferred from other
 signals), you can use the trainImplicit method to get better results.
 
-# Build the recommendation model using 
Alternating Least Squares based on implicit ratings
-model = ALS.trainImplicit(ratings, rank,

[GitHub] spark issue #16424: [SPARK-19016][SQL][DOC] Document scalable partition hand...

2016-12-28 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16424
  
@ericl  @mallman and @cloud-fan 
want to take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[spark] Git Push Summary

2016-12-28 Thread yhuai

Repository: spark
Updated Tags:  refs/tags/v2.1.0 [created] cd0a08361

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[GitHub] spark pull request #16423: Update known_translations for contributor names a...

2016-12-28 Thread yhuai

GitHub user yhuai opened a pull request:

https://github.com/apache/spark/pull/16423

Update known_translations for contributor names and also fix a small issue 
in translate-contributors.py

## What changes were proposed in this pull request?
This PR updates dev/create-release/known_translations to add more 
contributor name mapping. It also fixes a small issue in 
translate-contributors.py

## How was this patch tested?
n/a

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yhuai/spark contributors

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16423


commit 9ce7ef88d98e3aa5d10d761af1967912517a97e1
Author: Yin Huai <yh...@databricks.com>
Date:   2016-12-28T18:38:04Z

Update contributor list and also fix a small issue in 
translate-contributors.py




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer for Data...

2016-12-27 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16391
  
Just reverted from master. 

btw, I think we can keep existing cases and then add new cases. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

spark git commit: Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset"

2016-12-27 Thread yhuai

Repository: spark
Updated Branches:
  refs/heads/master a05cc425a -> 2404d8e54


Revert "[SPARK-18990][SQL] make DatasetBenchmark fairer for Dataset"

This reverts commit a05cc425a0a7d18570b99883993a04ad175aa071.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2404d8e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2404d8e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2404d8e5

Branch: refs/heads/master
Commit: 2404d8e54b6b2cfc78d892e7ebb31578457518a3
Parents: a05cc42
Author: Yin Huai 
Authored: Tue Dec 27 10:03:52 2016 -0800
Committer: Yin Huai 
Committed: Tue Dec 27 10:03:52 2016 -0800

--
 .../org/apache/spark/sql/DatasetBenchmark.scala | 75 +---
 1 file changed, 33 insertions(+), 42 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2404d8e5/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
--
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
index cd925e6..66d94d6 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala
@@ -17,6 +17,7 @@
 
 package org.apache.spark.sql
 
+import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.expressions.Aggregator
 import org.apache.spark.sql.expressions.scalalang.typed
 import org.apache.spark.sql.functions._
@@ -33,13 +34,11 @@ object DatasetBenchmark {
   def backToBackMap(spark: SparkSession, numRows: Long, numChains: Int): 
Benchmark = {
 import spark.implicits._
 
-val rdd = spark.sparkContext.range(0, numRows)
-val ds = spark.range(0, numRows)
-val df = ds.toDF("l")
-val func = (l: Long) => l + 1
-
+val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
 val benchmark = new Benchmark("back-to-back map", numRows)
+val func = (d: Data) => Data(d.l + 1, d.s)
 
+val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, 
l.toString))
 benchmark.addCase("RDD") { iter =>
   var res = rdd
   var i = 0
@@ -54,14 +53,14 @@ object DatasetBenchmark {
   var res = df
   var i = 0
   while (i < numChains) {
-res = res.select($"l" + 1 as "l")
+res = res.select($"l" + 1 as "l", $"s")
 i += 1
   }
   res.queryExecution.toRdd.foreach(_ => Unit)
 }
 
 benchmark.addCase("Dataset") { iter =>
-  var res = ds.as[Long]
+  var res = df.as[Data]
   var i = 0
   while (i < numChains) {
 res = res.map(func)
@@ -76,14 +75,14 @@ object DatasetBenchmark {
   def backToBackFilter(spark: SparkSession, numRows: Long, numChains: Int): 
Benchmark = {
 import spark.implicits._
 
-val rdd = spark.sparkContext.range(0, numRows)
-val ds = spark.range(0, numRows)
-val df = ds.toDF("l")
-val func = (l: Long, i: Int) => l % (100L + i) == 0L
-val funcs = 0.until(numChains).map { i => (l: Long) => func(l, i) }
-
+val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
 val benchmark = new Benchmark("back-to-back filter", numRows)
+val func = (d: Data, i: Int) => d.l % (100L + i) == 0L
+val funcs = 0.until(numChains).map { i =>
+  (d: Data) => func(d, i)
+}
 
+val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, 
l.toString))
 benchmark.addCase("RDD") { iter =>
   var res = rdd
   var i = 0
@@ -105,7 +104,7 @@ object DatasetBenchmark {
 }
 
 benchmark.addCase("Dataset") { iter =>
-  var res = ds.as[Long]
+  var res = df.as[Data]
   var i = 0
   while (i < numChains) {
 res = res.filter(funcs(i))
@@ -134,29 +133,24 @@ object DatasetBenchmark {
   def aggregate(spark: SparkSession, numRows: Long): Benchmark = {
 import spark.implicits._
 
-val rdd = spark.sparkContext.range(0, numRows)
-val ds = spark.range(0, numRows)
-val df = ds.toDF("l")
-
+val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
 val benchmark = new Benchmark("aggregate", numRows)
 
+val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, 
l.toString))
 benchmark.addCase("RDD sum") { iter =>
-  rdd.map(l => (l % 10, l)).reduceByKey(_ + _).foreach(_ => Unit)
+  rdd.aggregate(0L)(_ + _.l, _ + _)
 }
 
 benchmark.addCase("DataFrame sum") { iter =>
-  df.groupBy($"l" % 10).agg(sum($"l")).queryExecution.toRdd.foreach(_ => 
Unit)
+  df.select(sum($"l")).queryExecution.toRdd.foreach(_ => Unit)
 }
 
 benchmark.addCase("Dataset sum using

[GitHub] spark issue #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer for Data...

2016-12-27 Thread yhuai

Github user yhuai commented on the issue:

https://github.com/apache/spark/pull/16391
  
Seems there is no explicit LGTM. I am reverting this change from master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16391: [SPARK-18990][SQL] make DatasetBenchmark fairer f...

2016-12-27 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/16391#discussion_r93955613
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetBenchmark.scala ---
@@ -133,24 +134,29 @@ object DatasetBenchmark {
   def aggregate(spark: SparkSession, numRows: Long): Benchmark = {
 import spark.implicits._
 
-val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
+val rdd = spark.sparkContext.range(0, numRows)
+val ds = spark.range(0, numRows)
+val df = ds.toDF("l")
+
 val benchmark = new Benchmark("aggregate", numRows)
 
-val rdd = spark.sparkContext.range(1, numRows).map(l => Data(l, 
l.toString))
 benchmark.addCase("RDD sum") { iter =>
-  rdd.aggregate(0L)(_ + _.l, _ + _)
+  rdd.map(l => (l % 10, l)).reduceByKey(_ + _).foreach(_ => Unit)
--- End diff --

Actually, I think we should also have a test case for aggregation without 
group by.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 5990 matches

Mail list logo