[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79166839
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -401,7 +401,10 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
   var statsProperties: Map[String, String] =
 Map(STATISTICS_TOTAL_SIZE -> stats.sizeInBytes.toString())
   if (stats.rowCount.isDefined) {
-statsProperties += (STATISTICS_NUM_ROWS -> 
stats.rowCount.get.toString())
+statsProperties += STATISTICS_NUM_ROWS -> 
stats.rowCount.get.toString()
--- End diff --

General question: Is this how Hive stores this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79161115
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
--- End diff --

Why not user a more functional approach, and use `map` here? That saves 
adding elements by hand, and creating the list in the first place.

You can also deduplicate the columns here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79166502
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/StatisticsSuite.scala ---
@@ -101,4 +101,47 @@ class StatisticsSuite extends QueryTest with 
SharedSQLContext {
   checkTableStats(tableName, expectedRowCount = Some(2))
 }
   }
+
+  test("test column-level statistics for data source table created in 
InMemoryCatalog") {
+def checkColStats(colStats: BasicColStats, expectedColStats: 
BasicColStats): Unit = {
+  assert(colStats.dataType == expectedColStats.dataType)
+  assert(colStats.numNulls == expectedColStats.numNulls)
+  assert(colStats.max == expectedColStats.max)
+  assert(colStats.min == expectedColStats.min)
+  if (expectedColStats.ndv.isDefined) {
+// ndv is an approximate value, so we just make sure we have the 
value
+assert(colStats.ndv.get >= 0)
--- End diff --

you can also check if the value is within 3 standard deviations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79163397
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
+  }
+  validColumns += exprOption.get
+}
+
+relation match {
+  case catalogRel: CatalogRelation =>
+updateStats(catalogRel.catalogTable,
+  AnalyzeTableCommand.calculateTotalSize(sparkSession, 
catalogRel.catalogTable))
+
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateStats(logicalRel.catalogTable.get, 
logicalRel.relation.sizeInBytes)
+
+  case otherRelation =>
+throw new AnalysisException("ANALYZE TABLE is not supported for " +
+  s"${otherRelation.nodeName}.")
+}
+
+def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit 
= {
+  // Collect statistics per column.
+  // The first element in the result will be the overall row count, 
the following elements
+  // will be structs containing all column stats.
+  // The layout of each struct follows the layout of the BasicColStats.
+  val ndvMaxErr = sessionState.conf.ndvMaxError
+  val expressions = Count(Literal(1)).toAggregateExpression() +:
+validColumns.map(ColumnStatsStruct(_, ndvMaxErr))
+  val namedExpressions = expressions.map(e => Alias(e, e.toString)())
+  val statsRow = Dataset.ofRows(sparkSession, Aggregate(Nil, 
namedExpressions, relation))
+.queryExecution.toRdd.collect().head
+
+  // unwrap the result
+  val rowCount = statsRow.getLong(0)
+  val colStats = validColumns.zipWithIndex.map { case (expr, i) =>
+val colInfo = statsRow.getStruct(i + 1, 
ColumnStatsStruct.statsNumber)
+val colStats = ColumnStatsStruct.unwrapRow(expr, colInfo)
+(expr.name, colStats)
+  }.toMap
+
+  val statistics =
+Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), 
basicColStats = colStats)
--- End diff --

I think that we should document the policy for (partial) statist

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79161012
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
--- End diff --

To me it is not a problem when a user a specifies the same columns twice. 
Lets just deduplicate this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79161613
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
+  }
+  validColumns += exprOption.get
+}
+
+relation match {
+  case catalogRel: CatalogRelation =>
+updateStats(catalogRel.catalogTable,
+  AnalyzeTableCommand.calculateTotalSize(sparkSession, 
catalogRel.catalogTable))
+
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateStats(logicalRel.catalogTable.get, 
logicalRel.relation.sizeInBytes)
+
+  case otherRelation =>
+throw new AnalysisException("ANALYZE TABLE is not supported for " +
+  s"${otherRelation.nodeName}.")
+}
+
+def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit 
= {
+  // Collect statistics per column.
+  // The first element in the result will be the overall row count, 
the following elements
+  // will be structs containing all column stats.
+  // The layout of each struct follows the layout of the BasicColStats.
+  val ndvMaxErr = sessionState.conf.ndvMaxError
+  val expressions = Count(Literal(1)).toAggregateExpression() +:
+validColumns.map(ColumnStatsStruct(_, ndvMaxErr))
+  val namedExpressions = expressions.map(e => Alias(e, e.toString)())
+  val statsRow = Dataset.ofRows(sparkSession, Aggregate(Nil, 
namedExpressions, relation))
+.queryExecution.toRdd.collect().head
+
+  // unwrap the result
+  val rowCount = statsRow.getLong(0)
+  val colStats = validColumns.zipWithIndex.map { case (expr, i) =>
+val colInfo = statsRow.getStruct(i + 1, 
ColumnStatsStruct.statsNumber)
--- End diff --

Lets move these two lines into the `ColumStatsStruct` object. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrast

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79161379
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
--- End diff --

`exprOption.getOrElse(throw new AnalysisException(s"Invalid column name: 
$col")) is shorter`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79158813
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
+  }
+  validColumns += exprOption.get
+}
+
+relation match {
+  case catalogRel: CatalogRelation =>
+updateStats(catalogRel.catalogTable,
+  AnalyzeTableCommand.calculateTotalSize(sparkSession, 
catalogRel.catalogTable))
+
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateStats(logicalRel.catalogTable.get, 
logicalRel.relation.sizeInBytes)
+
+  case otherRelation =>
+throw new AnalysisException("ANALYZE TABLE is not supported for " +
+  s"${otherRelation.nodeName}.")
+}
+
+def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit 
= {
+  // Collect statistics per column.
+  // The first element in the result will be the overall row count, 
the following elements
+  // will be structs containing all column stats.
+  // The layout of each struct follows the layout of the BasicColStats.
+  val ndvMaxErr = sessionState.conf.ndvMaxError
+  val expressions = Count(Literal(1)).toAggregateExpression() +:
+validColumns.map(ColumnStatsStruct(_, ndvMaxErr))
+  val namedExpressions = expressions.map(e => Alias(e, e.toString)())
+  val statsRow = Dataset.ofRows(sparkSession, Aggregate(Nil, 
namedExpressions, relation))
+.queryExecution.toRdd.collect().head
+
+  // unwrap the result
+  val rowCount = statsRow.getLong(0)
+  val colStats = validColumns.zipWithIndex.map { case (expr, i) =>
+val colInfo = statsRow.getStruct(i + 1, 
ColumnStatsStruct.statsNumber)
+val colStats = ColumnStatsStruct.unwrapRow(expr, colInfo)
+(expr.name, colStats)
+  }.toMap
+
+  val statistics =
+Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), 
basicColStats = colStats)
+  sessionState.catalog.alterTable(catalogTable.copy(stats = 
Some(statistics)))

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79158612
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
+  }
+  validColumns += exprOption.get
+}
+
+relation match {
+  case catalogRel: CatalogRelation =>
+updateStats(catalogRel.catalogTable,
+  AnalyzeTableCommand.calculateTotalSize(sparkSession, 
catalogRel.catalogTable))
+
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateStats(logicalRel.catalogTable.get, 
logicalRel.relation.sizeInBytes)
+
+  case otherRelation =>
+throw new AnalysisException("ANALYZE TABLE is not supported for " +
+  s"${otherRelation.nodeName}.")
+}
+
+def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit 
= {
+  // Collect statistics per column.
+  // The first element in the result will be the overall row count, 
the following elements
+  // will be structs containing all column stats.
+  // The layout of each struct follows the layout of the BasicColStats.
+  val ndvMaxErr = sessionState.conf.ndvMaxError
+  val expressions = Count(Literal(1)).toAggregateExpression() +:
+validColumns.map(ColumnStatsStruct(_, ndvMaxErr))
+  val namedExpressions = expressions.map(e => Alias(e, e.toString)())
+  val statsRow = Dataset.ofRows(sparkSession, Aggregate(Nil, 
namedExpressions, relation))
+.queryExecution.toRdd.collect().head
+
+  // unwrap the result
+  val rowCount = statsRow.getLong(0)
+  val colStats = validColumns.zipWithIndex.map { case (expr, i) =>
+val colInfo = statsRow.getStruct(i + 1, 
ColumnStatsStruct.statsNumber)
+val colStats = ColumnStatsStruct.unwrapRow(expr, colInfo)
+(expr.name, colStats)
+  }.toMap
+
+  val statistics =
+Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), 
basicColStats = colStats)
+  sessionState.catalog.alterTable(catalogTable.copy(stats = 
Some(statistics)))

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79159036
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
+  if (exprOption.isEmpty) {
+throw new AnalysisException(s"Invalid column name: $col")
+  }
+  if (validColumns.map(_.exprId).contains(exprOption.get.exprId)) {
+throw new AnalysisException(s"Duplicate column name: $col")
+  }
+  validColumns += exprOption.get
+}
+
+relation match {
+  case catalogRel: CatalogRelation =>
+updateStats(catalogRel.catalogTable,
+  AnalyzeTableCommand.calculateTotalSize(sparkSession, 
catalogRel.catalogTable))
+
+  case logicalRel: LogicalRelation if 
logicalRel.catalogTable.isDefined =>
+updateStats(logicalRel.catalogTable.get, 
logicalRel.relation.sizeInBytes)
+
+  case otherRelation =>
+throw new AnalysisException("ANALYZE TABLE is not supported for " +
+  s"${otherRelation.nodeName}.")
+}
+
+def updateStats(catalogTable: CatalogTable, newTotalSize: Long): Unit 
= {
+  // Collect statistics per column.
+  // The first element in the result will be the overall row count, 
the following elements
+  // will be structs containing all column stats.
+  // The layout of each struct follows the layout of the BasicColStats.
+  val ndvMaxErr = sessionState.conf.ndvMaxError
+  val expressions = Count(Literal(1)).toAggregateExpression() +:
+validColumns.map(ColumnStatsStruct(_, ndvMaxErr))
+  val namedExpressions = expressions.map(e => Alias(e, e.toString)())
+  val statsRow = Dataset.ofRows(sparkSession, Aggregate(Nil, 
namedExpressions, relation))
+.queryExecution.toRdd.collect().head
+
+  // unwrap the result
+  val rowCount = statsRow.getLong(0)
+  val colStats = validColumns.zipWithIndex.map { case (expr, i) =>
+val colInfo = statsRow.getStruct(i + 1, 
ColumnStatsStruct.statsNumber)
+val colStats = ColumnStatsStruct.unwrapRow(expr, colInfo)
+(expr.name, colStats)
+  }.toMap
+
+  val statistics =
+Statistics(sizeInBytes = newTotalSize, rowCount = Some(rowCount), 
basicColStats = colStats)
+  sessionState.catalog.alterTable(catalogTable.copy(stats = 
Some(statistics)))

[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79160641
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeColumnCommand.scala
 ---
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.collection.mutable
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
+import org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
CatalogTable}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, 
BasicColStats, Statistics}
+import org.apache.spark.sql.execution.datasources.LogicalRelation
+import org.apache.spark.sql.types._
+
+
+/**
+ * Analyzes the given columns of the given table in the current database 
to generate statistics,
+ * which will be used in query optimizations.
+ */
+case class AnalyzeColumnCommand(
+tableIdent: TableIdentifier,
+columnNames: Seq[String]) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val sessionState = sparkSession.sessionState
+val relation = 
EliminateSubqueryAliases(sessionState.catalog.lookupRelation(tableIdent))
+
+// check correctness of column names
+val validColumns = mutable.MutableList[NamedExpression]()
+val resolver = sessionState.conf.resolver
+columnNames.foreach { col =>
+  val exprOption = relation.resolve(col.split("\\."), resolver)
--- End diff --

We need to very careful here, this will go wrong as soon as the name is 
quoted and the name contains dots, e.g.: `` `a.d` ``. This should be treated as 
a single name. Please make this command take `UnresolvedAttribute`s instead of 
strings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79168577
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsColumnSuite.scala 
---
@@ -0,0 +1,228 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.sql.{Date, Timestamp}
+
+import org.apache.spark.sql.{AnalysisException, Row}
+import org.apache.spark.sql.catalyst.plans.logical.BasicColStats
+import org.apache.spark.sql.execution.command.AnalyzeColumnCommand
+import org.apache.spark.sql.types._
+
+class StatisticsColumnSuite extends StatisticsTest {
+
+  test("parse analyze column commands") {
+val table = "table"
+assertAnalyzeCommand(
+  s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, value",
+  classOf[AnalyzeColumnCommand])
+
+val noColumnError = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS")
+}
+assert(noColumnError.message == "Need to specify the columns to 
analyze. Usage: " +
+  "ANALYZE TABLE tbl COMPUTE STATISTICS FOR COLUMNS key, value")
+
+withTable(table) {
+  sql(s"CREATE TABLE $table (key INT, value STRING)")
+  val invalidColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS k")
+  }
+  assert(invalidColError.message == s"Invalid column name: k")
+
+  val duplicateColError = intercept[AnalysisException] {
+sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, key")
+  }
+  assert(duplicateColError.message == s"Duplicate column name: key")
+
+  withSQLConf("spark.sql.caseSensitive" -> "true") {
+val invalidErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS keY")
+}
+assert(invalidErr.message == s"Invalid column name: keY")
+  }
+
+  withSQLConf("spark.sql.caseSensitive" -> "false") {
+val duplicateErr = intercept[AnalysisException] {
+  sql(s"ANALYZE TABLE $table COMPUTE STATISTICS FOR COLUMNS key, 
value, vaLue")
+}
+assert(duplicateErr.message == s"Duplicate column name: vaLue")
+  }
+}
+  }
+
+  test("basic statistics for integral type columns") {
+val rdd = sparkContext.parallelize(Seq("1", null, "2", "3", null)).map 
{ i =>
+  if (i != null) Row(i.toByte, i.toShort, i.toInt, i.toLong) else 
Row(i, i, i, i)
+}
+val schema = StructType(
+  StructField(name = "c1", dataType = ByteType, nullable = true) ::
+StructField(name = "c2", dataType = ShortType, nullable = true) ::
+StructField(name = "c3", dataType = IntegerType, nullable = true) 
::
+StructField(name = "c4", dataType = LongType, nullable = true) :: 
Nil)
+val expectedBasicStats = BasicColStats(
+  dataType = ByteType, numNulls = 2, max = Some(3), min = Some(1), ndv 
= Some(3))
--- End diff --

It is weird that this works. The Min and the Max value should not be equal 
for Short and Long types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15090: [SPARK-17073] [SQL] generate column-level statist...

2016-09-16 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15090#discussion_r79166182
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -563,6 +563,13 @@ object SQLConf {
   .timeConf(TimeUnit.MILLISECONDS)
   .createWithDefault(10L)
 
+  val NDV_MAX_ERROR =
+SQLConfigBuilder("spark.sql.ndv.maxError")
--- End diff --

Could you place this under `spark.sql.statistics`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15117: [SPARK-17561] [DOCS] DataFrameWriter documentation forma...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15117
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15117: [SPARK-17561] [DOCS] DataFrameWriter documentation forma...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15117
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65484/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15117: [SPARK-17561] [DOCS] DataFrameWriter documentation forma...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15117
  
**[Test build #65484 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65484/consoleFull)**
 for PR 15117 at commit 
[`5c9e486`](https://github.com/apache/spark/commit/5c9e486285d8b9eaa30dd52ac9c96398393f5e0b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79183882
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -204,13 +194,21 @@ case class CreateDataSourceTableAsSelectCommand(
   case None => data
 }
 
+val tableWithPath = if (table.tableType == CatalogTableType.MANAGED) {
+  table.withNewStorage(
--- End diff --

see 
https://github.com/apache/spark/pull/15024/files#diff-a4847709cb46baee83fc4d3e8cc5c998R200


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79184683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -665,15 +665,7 @@ case class AlterTableSetLocationCommand(
 catalog.alterPartitions(tableName, Seq(newPart))
   case None =>
 // No partition spec is specified, so we set the location for the 
table itself
-val newTable =
-  if (DDLUtils.isDatasourceTable(table)) {
-table.withNewStorage(
-  locationUri = Some(location),
-  properties = table.storage.properties ++ Map("path" -> 
location))
-  } else {
-table.withNewStorage(locationUri = Some(location))
-  }
-catalog.alterTable(newTable)
+catalog.alterTable(table.withNewStorage(locationUri = 
Some(location)))
--- End diff --

cc @yhuai @liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65485 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65485/consoleFull)**
 for PR 15024 at commit 
[`bccd283`](https://github.com/apache/spark/commit/bccd283f39f48712a2ce197601e1b7b46dff4a70).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79184660
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -665,15 +665,7 @@ case class AlterTableSetLocationCommand(
 catalog.alterPartitions(tableName, Seq(newPart))
   case None =>
 // No partition spec is specified, so we set the location for the 
table itself
-val newTable =
-  if (DDLUtils.isDatasourceTable(table)) {
-table.withNewStorage(
-  locationUri = Some(location),
-  properties = table.storage.properties ++ Map("path" -> 
location))
-  } else {
-table.withNewStorage(locationUri = Some(location))
-  }
-catalog.alterTable(newTable)
+catalog.alterTable(table.withNewStorage(locationUri = 
Some(location)))
--- End diff --

FYI we have a bug here, currently we allow users to `SET LOCATION` for 
managed data source table, however, in `SHOW CREATE TABLE`, we can't generate 
corrected SQL to create managed data source table whose location has been set, 
because data source with path is always treated as external table. We should 
either forbid `SET LOCATION` for managed data source table, or improve the 
CREATE TABLE syntax to support managed table with path.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread lresende
Github user lresende commented on the issue:

https://github.com/apache/spark/pull/14981
  
Ok, reverting the commit to remove kinesis assembly as the python tests are 
relying on it for the transient dependencies. Note that I was also trying to 
overcome this requirement by appending all the required dependencies in jars, 
but there seems to have other issues, in any case, these other experiments are 
available in the following branch 
https://github.com/lresende/spark/tree/remove-kinesis-assembly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14981
  
Yeah, it's either going to be turning off the profile entirely (if we can't 
distribute the non-assembly artifact), or leaving it on but manually excluding 
the assembly artifact.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14981
  
**[Test build #65486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65486/consoleFull)**
 for PR 14981 at commit 
[`6e3fec4`](https://github.com/apache/spark/commit/6e3fec4f6e04fc6beec195125031b0af8d21e809).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65487 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65487/consoleFull)**
 for PR 15024 at commit 
[`82c67cf`](https://github.com/apache/spark/commit/82c67cf10f898e13be33dd32b5568ba0e6b9b338).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14981
  
**[Test build #65488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65488/consoleFull)**
 for PR 14981 at commit 
[`e892004`](https://github.com/apache/spark/commit/e8920040e98b04ef24b5663613e7127cb447ae49).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79189646
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -409,16 +416,31 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 }
 
 if (DDLUtils.isDatasourceTable(withStatsProps)) {
-  val oldDef = client.getTable(db, withStatsProps.identifier.table)
-  // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from 
the old table definition,
-  // to retain the spark specific format if it is. Also add old data 
source properties to table
-  // properties, to retain the data source table format.
-  val oldDataSourceProps = 
oldDef.properties.filter(_._1.startsWith(DATASOURCE_PREFIX))
+  val oldTableDef = client.getTable(db, 
withStatsProps.identifier.table)
+
+  val oldLocation = 
oldTableDef.properties.getOrElse(DATASOURCE_LOCATION,
+oldTableDef.storage.locationUri.get)
+  // Only update the `locationUri` field if the location is really 
changed.
+  val newLocation = if (oldLocation == 
tableDefinition.storage.locationUri.get) {
--- End diff --

This is to keep the previous behaviour.

Previously, if we alter table fields (exclude location), we will keep the 
`locationUri` field and `path` option unchanged. If we alter table location, we 
will update both the `locationUri` field and `path` option, although it's 
dangerous to set `locationUri` as the path may be a file path.

Now, if we alter table fields (exclude location), we will keep the 
`locationUri` field unchanged, although the `locationUri` field may be 
different from the one in old raw table(the `oldTableDef` get from hive client 
directly) due to the hacks in `saveTableIntoHive`. If we alter table location, 
we will update the `locationUri` field even it's file path, same as before.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65489/consoleFull)**
 for PR 15024 at commit 
[`b09d8bb`](https://github.com/apache/spark/commit/b09d8bbe609eb8232b7aacabbfebb7d80c04f310).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15105
  
**[Test build #65490 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65490/consoleFull)**
 for PR 15105 at commit 
[`18e6bfe`](https://github.com/apache/spark/commit/18e6bfeb3736b62802c78bebe0296b77f2e5865c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15099: [SPARK-17541][SQL] fix some DDL bugs about table managem...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15099
  
**[Test build #65491 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65491/consoleFull)**
 for PR 15099 at commit 
[`c5513d5`](https://github.com/apache/spark/commit/c5513d591d5e4802a6b52ddd8034106a1f1cc739).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Thanks for the feedback @srowen!  I think 18e6bfe addresses everything from 
a code perspective, but it missed removing the comment about assuming that 
distinct words have distinct vector representations, so I've just pushed 
another commit that just removes that comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15105
  
**[Test build #65492 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65492/consoleFull)**
 for PR 15105 at commit 
[`624c0f8`](https://github.com/apache/spark/commit/624c0f8b1ad78f98cabcc51a1ed7a94f5f20a25a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15064: [SPARK-17509][SQL]When wrapping catalyst datatype...

2016-09-16 Thread sitalkedia
Github user sitalkedia commented on a diff in the pull request:

https://github.com/apache/spark/pull/15064#discussion_r79201464
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -240,74 +240,173 @@ private[hive] trait HiveInspectors {
 
   /**
* Wraps with Hive types based on object inspector.
-   * TODO: Consolidate all hive OI/data interface code.
*/
   protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any 
=> Any = oi match {
-case _: JavaHiveVarcharObjectInspector =>
+case x: ConstantObjectInspector =>
   (o: Any) =>
-if (o != null) {
-  val s = o.asInstanceOf[UTF8String].toString
-  new HiveVarchar(s, s.length)
-} else {
-  null
+x.getWritableConstantValue
+case x: PrimitiveObjectInspector => x match {
+  // TODO we don't support the HiveVarcharObjectInspector yet.
+  case _: StringObjectInspector if x.preferWritable() =>
+(o: Any) => getStringWritable(o)
+  case _: StringObjectInspector =>
+(o: Any) => if (o != null) o.asInstanceOf[UTF8String].toString() 
else null
+  case _: IntObjectInspector if x.preferWritable() =>
+(o: Any) => getIntWritable(o)
+  case _: IntObjectInspector =>
+(o: Any) => if (o != null) o.asInstanceOf[java.lang.Integer] else 
null
--- End diff --

Good point, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12004
  
**[Test build #65494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65494/consoleFull)**
 for PR 12004 at commit 
[`b04c037`](https://github.com/apache/spark/commit/b04c037f2925d9b698e541493fc936627ddcf9ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15064
  
**[Test build #65493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65493/consoleFull)**
 for PR 15064 at commit 
[`cba0b82`](https://github.com/apache/spark/commit/cba0b8245f0328c6f5647cd6593346b560cf1952).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/15064
  
>> the title is not fixed yet

@cloud-fan - I updated the PR title, what I am missing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15064
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15064
  
**[Test build #65493 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65493/consoleFull)**
 for PR 15064 at commit 
[`cba0b82`](https://github.com/apache/spark/commit/cba0b8245f0328c6f5647cd6593346b560cf1952).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15064
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65493/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12004
  
**[Test build #65494 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65494/consoleFull)**
 for PR 12004 at commit 
[`b04c037`](https://github.com/apache/spark/commit/b04c037f2925d9b698e541493fc936627ddcf9ba).
 * This patch **fails build dependency tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #12004: [SPARK-7481][build] [WIP] Add Hadoop 2.7+ spark-cloud mo...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12004
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65494/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15105
  
**[Test build #65490 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65490/consoleFull)**
 for PR 15105 at commit 
[`18e6bfe`](https://github.com/apache/spark/commit/18e6bfeb3736b62802c78bebe0296b77f2e5865c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15105
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65490/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15105
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65485 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65485/consoleFull)**
 for PR 15024 at commit 
[`bccd283`](https://github.com/apache/spark/commit/bccd283f39f48712a2ce197601e1b7b46dff4a70).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65485/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15064
  
**[Test build #65495 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65495/consoleFull)**
 for PR 15064 at commit 
[`77485cd`](https://github.com/apache/spark/commit/77485cdfa5b1371f07a08882ec806e5663d89cc3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread sitalkedia
Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/15064
  
@gatorsmile - We do have Linux Perf/flamegraph support for gathering CPU 
stats for our Spark jobs. On a per job basis, we run Perf/Jstack periodically 
and aggregate the result which is displayed as a flamegraph. Please read 
http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html to know more 
about linux Perf and flamegraph integration. This gives us a aggregated view of 
time spent on a particular function across executors running on hundreds of 
nodes. This enables us to find hot spots in the code which is super critical 
for perf improvements. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65487 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65487/consoleFull)**
 for PR 15024 at commit 
[`82c67cf`](https://github.com/apache/spark/commit/82c67cf10f898e13be33dd32b5568ba0e6b9b338).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65487/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15112: [RFC][SPARK-17549][sql] Only collect table size stat in ...

2016-09-16 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/15112
  
@yhuai any more comments?

I really want to keep the codegen metrics change because otherwise Spark 
just fails on the large table I tested on. We can file a separate bug to look 
at the janino issue and point at this one (and the data attached to the bug) as 
the source of the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65489 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65489/consoleFull)**
 for PR 15024 at commit 
[`b09d8bb`](https://github.com/apache/spark/commit/b09d8bbe609eb8232b7aacabbfebb7d80c04f310).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65496 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65496/consoleFull)**
 for PR 15024 at commit 
[`1953b61`](https://github.com/apache/spark/commit/1953b61ababb7113c6b22b45f0a8633eaaa40217).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15112: [RFC][SPARK-17549][sql] Only collect table size s...

2016-09-16 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15112#discussion_r79208645
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -910,14 +910,19 @@ object CodeGenerator extends Logging {
 codeAttrField.setAccessible(true)
 classes.foreach { case (_, classBytes) =>
   
CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
-  val cf = new ClassFile(new ByteArrayInputStream(classBytes))
-  cf.methodInfos.asScala.foreach { method =>
-method.getAttributes().foreach { a =>
-  if (a.getClass.getName == codeAttr.getName) {
-CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
-  codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+  try {
+val cf = new ClassFile(new ByteArrayInputStream(classBytes))
+cf.methodInfos.asScala.foreach { method =>
+  method.getAttributes().foreach { a =>
+if (a.getClass.getName == codeAttr.getName) {
+  CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+}
   }
 }
--- End diff --

My worry is that we will just forget about this issue if we just make it 
log a warning. Removing this try/catch will not fail any existing tests, right? 
We can create a new jira to fix this issue for Spark 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65489/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79208945
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -409,16 +416,34 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 }
 
 if (DDLUtils.isDatasourceTable(withStatsProps)) {
-  val oldDef = client.getTable(db, withStatsProps.identifier.table)
-  // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from 
the old table definition,
-  // to retain the spark specific format if it is. Also add old data 
source properties to table
-  // properties, to retain the data source table format.
-  val oldDataSourceProps = 
oldDef.properties.filter(_._1.startsWith(DATASOURCE_PREFIX))
+  val oldTableDef = client.getTable(db, 
withStatsProps.identifier.table)
+
+  val oldLocation = if (tableDefinition.tableType == EXTERNAL) {
+oldTableDef.properties.get(DATASOURCE_LOCATION)
+  } else {
+tableDefinition.storage.locationUri
+  }
+  // Only update the `locationUri` field if the location is really 
changed.
+  val newLocation = if (oldLocation == 
tableDefinition.storage.locationUri) {
--- End diff --

This is to keep the previous behaviour.

Previously, if we alter table fields (exclude location), we will keep the 
`locationUri` field and `path` option unchanged. If we alter table location, we 
will update both the `locationUri` field and `path` option, although it's 
dangerous to set `locationUri` as the path may be a file path.

Now, if we alter table fields (exclude location), we will keep the 
`locationUri` field unchanged, although the `locationUri` field may be 
different from the one in old raw table(the `oldTableDef` get from hive client 
directly) due to the hacks in `saveTableIntoHive`. If we alter table location, 
we will update the `locationUri` field even it's file path, which is the same 
behaviour as before.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15105
  
**[Test build #65492 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65492/consoleFull)**
 for PR 15105 at commit 
[`624c0f8`](https://github.com/apache/spark/commit/624c0f8b1ad78f98cabcc51a1ed7a94f5f20a25a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

2016-09-16 Thread steveloughran
Github user steveloughran commented on a diff in the pull request:

https://github.com/apache/spark/pull/14731#discussion_r79209668
  
--- Diff: docs/streaming-programming-guide.md ---
@@ -644,13 +644,44 @@ methods for creating DStreams from files as input 
sources.
 
 
 
-   Spark Streaming will monitor the directory `dataDirectory` and process 
any files created in that directory (files written in nested directories not 
supported). Note that
-
- + The files must have the same data format.
- + The files must be created in the `dataDirectory` by atomically 
*moving* or *renaming* them into
- the data directory.
- + Once moved, the files must not be changed. So if the files are 
being continuously appended, the new data will not be read.
-
+   Spark Streaming will monitor the directory `dataDirectory` and process 
any files created in that directory.
+
+ + A simple directory can be supplied, such as 
`hdfs://namenode:8040/logs/`.
+   All files directly such a path will be processed as they are 
discovered.
+ + A regular expression can be supplied instead, such as
+   `hdfs://namenode:8040/logs/2016-*-31`.
+   Here, the DStream will consist of all files directly under those 
directories
+   matching the regular expression.
+   That is: it is a pattern of directories, not of files in 
directories.
+ + All files must be in the same data format.
+ * A file is considered part of a time period based on its 
modification time
+   —not its creation time.
+ + Files must be created in/moved under the `dataDirectory` 
directory/directories by
+   an atomic operation. In HDFS and similar filesystems, this can be 
done *renaming* them
+   into the data directory from another part of the same filesystem.
+ * If a wildcard is used to identify directories, such as 
`hdfs://namenode:8040/logs/2016*`,
+   renaming an entire directory to match the path will add the 
directory to the list of
+   monitored directories. However, unless the modification time of the 
directory's files
+   are within that of the current window, they will not be recognized 
as new files.
+ + Once processed, changes to a file will not cause the file to be 
reread.
+   That is: Updates are ignored.
+ + The more files under a directory/wildcard pattern, the longer it 
will take to
+   scan for changes —even if no files have actually changed.
+
+Special points for object stores
--- End diff --

For object stores, direct writes to the directory, resulting in a PUT on 
close(), will guarantee that a file is picked up immediately. Things are 
actually a bit quirky for HDFS; even file length doesn't get updated reliably 
during a write-in-progress. I'll add a section there and then ask people who 
understand HDFS what is really happening


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15105
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65492/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15105
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65497 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65497/consoleFull)**
 for PR 15024 at commit 
[`5dfdc35`](https://github.com/apache/spark/commit/5dfdc357ea2f1a73e31a060ff027872e62e99c58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65498 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65498/consoleFull)**
 for PR 14731 at commit 
[`735fc7c`](https://github.com/apache/spark/commit/735fc7c2343c08a323e3d213e611830e3b41ef04).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14981
  
**[Test build #65486 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65486/consoleFull)**
 for PR 14981 at commit 
[`6e3fec4`](https://github.com/apache/spark/commit/6e3fec4f6e04fc6beec195125031b0af8d21e809).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14981
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65486/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14981
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15112: [RFC][SPARK-17549][sql] Only collect table size s...

2016-09-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15112#discussion_r79210556
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -910,14 +910,19 @@ object CodeGenerator extends Logging {
 codeAttrField.setAccessible(true)
 classes.foreach { case (_, classBytes) =>
   
CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
-  val cf = new ClassFile(new ByteArrayInputStream(classBytes))
-  cf.methodInfos.asScala.foreach { method =>
-method.getAttributes().foreach { a =>
-  if (a.getClass.getName == codeAttr.getName) {
-CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
-  codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+  try {
+val cf = new ClassFile(new ByteArrayInputStream(classBytes))
+cf.methodInfos.asScala.foreach { method =>
+  method.getAttributes().foreach { a =>
+if (a.getClass.getName == codeAttr.getName) {
+  CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+}
   }
 }
--- End diff --

How about my suggestion of adding the workaround and filing a bug? Then 
there's no worry about forgetting anything.

Because it's most probably a Janino bug, fixing it might not be as simple 
as just making some change in Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14981
  
**[Test build #65488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65488/consoleFull)**
 for PR 14981 at commit 
[`e892004`](https://github.com/apache/spark/commit/e8920040e98b04ef24b5663613e7127cb447ae49).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14981
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14981: [SPARK-17418] Remove Kinesis artifacts from Spark releas...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14981
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65488/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65499 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65499/consoleFull)**
 for PR 15024 at commit 
[`19ea121`](https://github.com/apache/spark/commit/19ea121e75ed2e57b4919f4112d48e27299efdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15024: [SPARK-17470][SQL] unify path for data source tab...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15024#discussion_r79212283
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala ---
@@ -409,16 +416,30 @@ private[spark] class HiveExternalCatalog(conf: 
SparkConf, hadoopConf: Configurat
 }
 
 if (DDLUtils.isDatasourceTable(withStatsProps)) {
-  val oldDef = client.getTable(db, withStatsProps.identifier.table)
-  // Sets the `schema`, `partitionColumnNames` and `bucketSpec` from 
the old table definition,
-  // to retain the spark specific format if it is. Also add old data 
source properties to table
-  // properties, to retain the data source table format.
-  val oldDataSourceProps = 
oldDef.properties.filter(_._1.startsWith(DATASOURCE_PREFIX))
+  val oldTableDef = client.getTable(db, 
withStatsProps.identifier.table)
+
+  val oldLocation = getLocationFromRawTable(oldTableDef)
+  // Only update the `locationUri` field if the location is really 
changed.
+  val newLocation = if (oldLocation == 
tableDefinition.storage.locationUri) {
--- End diff --

This is to keep the previous behaviour.

Previously, if we alter table fields (exclude location), we will keep the 
`locationUri` field and `path` option unchanged. If we alter table location, we 
will update both the `locationUri` field and `path` option, although it's 
dangerous to set `locationUri` as the path may be a file path.

Now, if we alter table fields (exclude location), we will keep the 
`locationUri` field unchanged, although the `locationUri` field may be 
different from the one in old raw table(the `oldTableDef` get from hive client 
directly) due to the hacks in `saveTableIntoHive`. If we alter table location, 
we will update the `locationUri` field even it's file path, which is the same 
behaviour as before.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15064: [SPARK-17509][SQL]When wrapping catalyst datatype...

2016-09-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15064#discussion_r79212737
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -238,102 +238,166 @@ private[hive] trait HiveInspectors {
 case c => throw new AnalysisException(s"Unsupported java type $c")
   }
 
+  private def withNullSafe(f: Any => Any): Any => Any = {
+input => if (input == null) null else f(input)
+  }
+
   /**
* Wraps with Hive types based on object inspector.
-   * TODO: Consolidate all hive OI/data interface code.
*/
   protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any 
=> Any = oi match {
-case _: JavaHiveVarcharObjectInspector =>
-  (o: Any) =>
-if (o != null) {
-  val s = o.asInstanceOf[UTF8String].toString
-  new HiveVarchar(s, s.length)
-} else {
-  null
-}
-
-case _: JavaHiveCharObjectInspector =>
-  (o: Any) =>
-if (o != null) {
-  val s = o.asInstanceOf[UTF8String].toString
-  new HiveChar(s, s.length)
-} else {
-  null
-}
-
-case _: JavaHiveDecimalObjectInspector =>
-  (o: Any) =>
-if (o != null) {
-  HiveDecimal.create(o.asInstanceOf[Decimal].toJavaBigDecimal)
-} else {
-  null
-}
-
-case _: JavaDateObjectInspector =>
-  (o: Any) =>
-if (o != null) {
-  DateTimeUtils.toJavaDate(o.asInstanceOf[Int])
-} else {
-  null
-}
-
-case _: JavaTimestampObjectInspector =>
+case x: ConstantObjectInspector =>
   (o: Any) =>
-if (o != null) {
-  DateTimeUtils.toJavaTimestamp(o.asInstanceOf[Long])
-} else {
-  null
-}
+x.getWritableConstantValue
+case x: PrimitiveObjectInspector => x match {
+  // TODO we don't support the HiveVarcharObjectInspector yet.
+  case _: StringObjectInspector if x.preferWritable() =>
+withNullSafe(o => getStringWritable(o))
+  case _: StringObjectInspector =>
+withNullSafe(o => o.asInstanceOf[UTF8String].toString())
+  case _: IntObjectInspector if x.preferWritable() =>
+withNullSafe(o => getIntWritable(o))
+  case _: IntObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Integer])
+  case _: BooleanObjectInspector if x.preferWritable() =>
+withNullSafe(o => getBooleanWritable(o))
+  case _: BooleanObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Boolean])
+  case _: FloatObjectInspector if x.preferWritable() =>
+withNullSafe(o => getFloatWritable(o))
+  case _: FloatObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Float])
+  case _: DoubleObjectInspector if x.preferWritable() =>
+withNullSafe(o => getDoubleWritable(o))
+  case _: DoubleObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Double])
+  case _: LongObjectInspector if x.preferWritable() =>
+withNullSafe(o => getLongWritable(o))
+  case _: LongObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Long])
+  case _: ShortObjectInspector if x.preferWritable() =>
+withNullSafe(o => getShortWritable(o))
+  case _: ShortObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Short])
+  case _: ByteObjectInspector if x.preferWritable() =>
+withNullSafe(o => getByteWritable(o))
+  case _: ByteObjectInspector =>
+withNullSafe(o => o.asInstanceOf[java.lang.Byte])
+  case _: JavaHiveVarcharObjectInspector =>
+withNullSafe(
--- End diff --

code style:
```
withNullSafe { o =>
  val s = o.asInstanceOf[UTF8String].toString
  new HiveVarchar(s, s.length)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14975: Correct fetchsize property name in docs

2016-09-16 Thread darabos
Github user darabos commented on the issue:

https://github.com/apache/spark/pull/14975
  
> @darabos do you want to close this out or should I do the update?

Sorry! I'll try to do it tonight. If I don't report back, consider me eaten 
by a monster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15099: [SPARK-17541][SQL] fix some DDL bugs about table ...

2016-09-16 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15099#discussion_r79214780
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
@@ -284,8 +284,12 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
* @since 2.0.0
*/
   override def dropTempView(viewName: String): Unit = {
-
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(viewName))
-sessionCatalog.dropTable(TableIdentifier(viewName), ignoreIfNotExists 
= true, purge = false)
+val maybeTempView = 
sparkSession.sessionState.catalog.getTempView(viewName)
+if (maybeTempView.isDefined) {
+  val view = SubqueryAlias(viewName, maybeTempView.get, 
Some(TableIdentifier(viewName)))
--- End diff --

Let me show an example in `CachedTableSuite.scala`:
```Scala
  test("Drops cached temporary table") {
testData.select('key).createOrReplaceTempView("t1")
testData.select('key).createOrReplaceTempView("t2")
spark.catalog.cacheTable("t1")

assert(spark.catalog.isCached("t1"))
assert(spark.catalog.isCached("t2"))

spark.catalog.dropTempView("t1")
intercept[AnalysisException](spark.table("t1"))
assert(!spark.catalog.isCached("t2"))
assert(spark.sharedState.cacheManager.isEmpty)
  }
```

It is not affected by `SubqueryAlias`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15099: [SPARK-17541][SQL] fix some DDL bugs about table managem...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15099
  
**[Test build #65491 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65491/consoleFull)**
 for PR 15099 at commit 
[`c5513d5`](https://github.com/apache/spark/commit/c5513d591d5e4802a6b52ddd8034106a1f1cc739).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15099: [SPARK-17541][SQL] fix some DDL bugs about table managem...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15099
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65491/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15099: [SPARK-17541][SQL] fix some DDL bugs about table managem...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15099
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14961: [SPARK-17379] [BUILD] Upgrade netty-all to 4.0.41 final ...

2016-09-16 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/14961
  
FYI, finally, I figured out the root cause: 
https://github.com/netty/netty/issues/5833

As far as I understand, `System.setProperty("io.netty.maxDirectMemory", 
"0");` should be a correct workaround.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15099: [SPARK-17541][SQL] fix some DDL bugs about table managem...

2016-09-16 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15099
  
LGTM, except one minor 
[comment](https://github.com/apache/spark/pull/15099/files#r79041060) 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79196391
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -583,7 +585,17 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
   _ => Future.successful(false)
 }
 
-  adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread)
+  val killResponse = 
adjustTotalExecutors.flatMap(killExecutors)(ThreadUtils.sameThread)
+
+  def execKilled(killStatus: Boolean): Future[Seq[String]] = 
Future.successful(
+if (killStatus) {
+  executorsToKill
+} else {
+  Seq.empty[String]
+}
+  )
+
+  killResponse.flatMap(flag => 
execKilled(flag))(ThreadUtils.sameThread)
--- End diff --

Agreed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79199040
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -392,10 +397,70 @@ private[spark] class ExecutorAllocationManager(
   }
 
   /**
+   * Request the cluster manager to remove the given executors.
+   * Return whether the request is acknowledged. Ideally we should be 
returning the list of
+   * executors which were removed as the requested executors and the one's 
actually removed can be
+   * different (CoarseGrainedSchedulerBackend can filter some executors). 
To avoid breaking the API
+   * we continue to return a Boolean.
+   */
+  private def removeExecutors(executors: Seq[String]): Boolean = 
synchronized {
+
+val executorIdsToBeRemoved = new ArrayBuffer[String]
+
+logInfo("Request to remove executorIds: " + executors.mkString(", "))
+val numExistingExecutors = executorIds.size - 
executorsPendingToRemove.size
+for(executorId <- executors) {
+  // Do not kill the executor if we have already reached the lower 
bound
+  val newExecutorTotal = numExistingExecutors - 
executorIdsToBeRemoved.size
+  if (newExecutorTotal - 1 < minNumExecutors) {
+logDebug(s"Not removing idle executor $executorId because there 
are only " +
+  s"$numExistingExecutors executor(s) left (limit 
$minNumExecutors)")
+  } else if (canBeKilled(executorId)) {
+executorIdsToBeRemoved += executorId
+  }
+}
+
+if (executorIdsToBeRemoved.isEmpty) {
+  return false
+}
+
+// Send a request to the backend to kill this executor(s)
+val executorsRemoved = if (testing) {
+  executorIdsToBeRemoved
+} else {
+  client.killExecutors(executorIdsToBeRemoved)
+}
+
+if (testing || executorsRemoved.nonEmpty) {
+  val numExistingExecutors = allocationManager.executorIds.size - 
executorsPendingToRemove.size
+  var index = 0
+  for(index <- 0 until executorsRemoved.size) {
--- End diff --

I would like to keep the numExistingExecutors, its easier to skim through 
logs for validating sequential release of executors with dynamic allocation 
enabled. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79196128
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala
 ---
@@ -83,8 +84,14 @@ class JobScheduler(val ssc: StreamingContext) extends 
Logging {
 listenerBus.start()
 receiverTracker = new ReceiverTracker(ssc)
 inputInfoTracker = new InputInfoTracker(ssc)
+
+val executorAllocClient: ExecutorAllocationClient = 
ssc.sparkContext.schedulerBackend match {
+  case b: ExecutorAllocationClient => 
b.asInstanceOf[ExecutorAllocationClient]
+  case _ => null
+}
+
 executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
-  ssc.sparkContext,
+  executorAllocClient,
--- End diff --

We are checking that in createIfEnabled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79198414
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ---
@@ -564,6 +564,8 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 .filter { id => force || !scheduler.isExecutorBusy(id) }
   executorsToKill.foreach { id => executorsPendingToRemove(id) = 
!replace }
 
+  logInfo(s"Requesting to kill filtered executor(s) 
${executorsToKill.mkString(", ")}")
--- End diff --

Its been added to differentiate between what was requested and what's 
actually being sent. If we drop it, the log statement above will read the same. 
How about "idle"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79217027
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -392,10 +397,70 @@ private[spark] class ExecutorAllocationManager(
   }
 
   /**
+   * Request the cluster manager to remove the given executors.
--- End diff --

okay. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14926: [SPARK-17365][Core] Remove/Kill multiple executor...

2016-09-16 Thread dhruve
Github user dhruve commented on a diff in the pull request:

https://github.com/apache/spark/pull/14926#discussion_r79198768
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -392,10 +397,70 @@ private[spark] class ExecutorAllocationManager(
   }
 
   /**
+   * Request the cluster manager to remove the given executors.
+   * Return whether the request is acknowledged. Ideally we should be 
returning the list of
+   * executors which were removed as the requested executors and the one's 
actually removed can be
+   * different (CoarseGrainedSchedulerBackend can filter some executors). 
To avoid breaking the API
+   * we continue to return a Boolean.
+   */
+  private def removeExecutors(executors: Seq[String]): Boolean = 
synchronized {
+
+val executorIdsToBeRemoved = new ArrayBuffer[String]
+
+logInfo("Request to remove executorIds: " + executors.mkString(", "))
+val numExistingExecutors = executorIds.size - 
executorsPendingToRemove.size
+for(executorId <- executors) {
+  // Do not kill the executor if we have already reached the lower 
bound
+  val newExecutorTotal = numExistingExecutors - 
executorIdsToBeRemoved.size
+  if (newExecutorTotal - 1 < minNumExecutors) {
+logDebug(s"Not removing idle executor $executorId because there 
are only " +
+  s"$numExistingExecutors executor(s) left (limit 
$minNumExecutors)")
+  } else if (canBeKilled(executorId)) {
+executorIdsToBeRemoved += executorId
+  }
+}
+
+if (executorIdsToBeRemoved.isEmpty) {
+  return false
+}
+
+// Send a request to the backend to kill this executor(s)
+val executorsRemoved = if (testing) {
+  executorIdsToBeRemoved
+} else {
+  client.killExecutors(executorIdsToBeRemoved)
+}
+
+if (testing || executorsRemoved.nonEmpty) {
+  val numExistingExecutors = allocationManager.executorIds.size - 
executorsPendingToRemove.size
+  var index = 0
+  for(index <- 0 until executorsRemoved.size) {
+val removedExecutorId = executorsRemoved(index)
+val newExecutorTotal = numExistingExecutors - (index + 1)
+logInfo(s"Removing executor $removedExecutorId because it has been 
idle for " +
+  s"$executorIdleTimeoutS seconds (new desired total will be 
$newExecutorTotal)")
+executorsPendingToRemove.add(removedExecutorId)
+  }
+  true
+} else {
+  logWarning(s"Unable to reach the cluster manager to kill executor/s 
" +
+executorIdsToBeRemoved.mkString(",") + "or no executor eligible to 
kill!")
--- End diff --

Yeah. That space is important.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15064
  
**[Test build #65495 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65495/consoleFull)**
 for PR 15064 at commit 
[`77485cd`](https://github.com/apache/spark/commit/77485cdfa5b1371f07a08882ec806e5663d89cc3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15064
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65495/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15064: [SPARK-17509][SQL]When wrapping catalyst datatype to Hiv...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15064
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15112: [RFC][SPARK-17549][sql] Only collect table size s...

2016-09-16 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15112#discussion_r79221081
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -910,14 +910,19 @@ object CodeGenerator extends Logging {
 codeAttrField.setAccessible(true)
 classes.foreach { case (_, classBytes) =>
   
CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
-  val cf = new ClassFile(new ByteArrayInputStream(classBytes))
-  cf.methodInfos.asScala.foreach { method =>
-method.getAttributes().foreach { a =>
-  if (a.getClass.getName == codeAttr.getName) {
-CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
-  codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+  try {
+val cf = new ClassFile(new ByteArrayInputStream(classBytes))
+cf.methodInfos.asScala.foreach { method =>
+  method.getAttributes().foreach { a =>
+if (a.getClass.getName == codeAttr.getName) {
+  CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+}
   }
 }
--- End diff --

OK. Seems this part is used to record some metrics. I guess it is fine. 
But, let me ping @ericl who added this method to double check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15112: [RFC][SPARK-17549][sql] Only collect table size s...

2016-09-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15112#discussion_r79221439
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -910,14 +910,19 @@ object CodeGenerator extends Logging {
 codeAttrField.setAccessible(true)
 classes.foreach { case (_, classBytes) =>
   
CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
-  val cf = new ClassFile(new ByteArrayInputStream(classBytes))
-  cf.methodInfos.asScala.foreach { method =>
-method.getAttributes().foreach { a =>
-  if (a.getClass.getName == codeAttr.getName) {
-CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
-  codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+  try {
+val cf = new ClassFile(new ByteArrayInputStream(classBytes))
+cf.methodInfos.asScala.foreach { method =>
+  method.getAttributes().foreach { a =>
+if (a.getClass.getName == codeAttr.getName) {
+  CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.update(
+codeAttrField.get(a).asInstanceOf[Array[Byte]].length)
+}
   }
 }
--- End diff --

In any case I filed SPARK-17565 to track the actual fix. This is just a 
workaround so Spark doesn't fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14731
  
**[Test build #65498 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65498/consoleFull)**
 for PR 14731 at commit 
[`735fc7c`](https://github.com/apache/spark/commit/735fc7c2343c08a323e3d213e611830e3b41ef04).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65498/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14731
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15024
  
**[Test build #65496 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65496/consoleFull)**
 for PR 15024 at commit 
[`1953b61`](https://github.com/apache/spark/commit/1953b61ababb7113c6b22b45f0a8633eaaa40217).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15024: [SPARK-17470][SQL] unify path for data source table and ...

2016-09-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15024
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65496/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15115: [SPARK-17558] Bump Hadoop 2.7 version from 2.7.2 to 2.7....

2016-09-16 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15115
  
Merging in master/2.0.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >