from:"liancheng"

[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...

2016-05-03 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12781#discussion_r61845886
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -389,3 +392,238 @@ case class ShowTablePropertiesCommand(table: 
TableIdentifier, propertyKey: Optio
 }
   }
 }
+
+/**
+ * A command to list the column names for a table. This function creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+sparkSession.sessionState.catalog.getTableMetadata(table).schema.map { 
c =>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command to list the partition names of a table. If the partition spec 
is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  private def getPartName(spec: TablePartitionSpec, partColNames: 
Seq[String]): String = {
+partColNames.map { name =>
+  PartitioningUtils.escapePathName(name) + "=" + 
PartitioningUtils.escapePathName(spec(name))
+}.mkString(File.separator)
+  }
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+val catalog = sparkSession.sessionState.catalog
+
+if (catalog.isTemporaryTable(table)) {
+  throw new AnalysisException(
+s"SHOW PARTITIONS is not allowed on a temporary table: 
${table.unquotedString}")
+}
+
+val tab = catalog.getTableMetadata(table)
+
+/**
+ * Validate and throws an [[AnalysisException]] exception under the 
following conditions:
+ * 1. If the table is not partitioned.
+ * 2. If it is a datasource table.
+ * 3. If it is a view or index table.
+ */
+if (tab.tableType == VIEW ||
+  tab.tableType == INDEX) {
+  throw new AnalysisException(
+s"SHOW PARTITIONS is not allowed on a view or index table: 
${tab.qualifiedName}")
+}
+
+if (!DDLUtils.isTablePartitioned(tab)) {
+  throw new AnalysisException(
+s"SHOW PARTITIONS is not allowed on a table that is not 
partitioned: ${tab.qualifiedName}")
+}
+
+if (DDLUtils.isDatasourceTable(tab)) {
+  throw new AnalysisException(
+s"SHOW PARTITIONS is not allowed on a datasource table: 
${tab.qualifiedName}")
+}
+
+/**
+ * Validate the partitioning spec by making sure all the referenced 
columns are
+ * defined as partitioning columns in table definition. An 
AnalysisException exception is
+ * thrown if the partitioning spec is invalid.
+ */
+if (spec.isDefined) {
+  val badColumns = 
spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains)
+  if (badColumns.nonEmpty) {
+val badCols = badColumns.mkString("[", ", ", "]")
+throw new AnalysisException(
+  s"Non-partitioning column(s) $badCols are specified for SHOW 
PARTITIONS")
+  }
+}
+
+val partNames = catalog.listPartitions(table, spec).map { p =>
+  getPartName(p.spec, tab.partitionColumnNames)
+}
+
+partNames.map(Row(_))
+  }
+}
+
+case class ShowCreateTableCommand(table: TableIdentifier) extends 
RunnableCommand {
+  override val output: Seq[Attribute] = Seq(
+AttributeReference("createtab_stmt", StringType, nullable = false)()
+  )

[GitHub] spark pull request: [SPARK-14127][SQL] Native "DESC [EXTENDED | FO...

2016-05-02 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12844

[SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command

## What changes were proposed in this pull request?

This PR implements native `DESC [EXTENDED | FORMATTED] ` DDL 
command. Sample output:

```
scala> spark.sql("desc extended src").show(100, truncate = false)
++-+---+
|col_name|data_type|comment|
++-+---+
|key |int  |   |
|value   |string   |   |
|| |   |
|# Detailed Table Information|CatalogTable(`default`.`src`, ...|   |
++-+---+


scala> spark.sql("desc formatted src").show(100, truncate = false)

++--+---+
|col_name|data_type 
|comment|

++--+---+
|key |int   
|   |
|value   |string
|   |
||  
|   |
|# Detailed Table Information|  
|   |
|Database:   |default   
|   |
|Owner:  |lian  
|   |
|Create Time:|Mon Jan 04 17:06:00 CST 2016  
|   |
|Last Access Time:   |Thu Jan 01 08:00:00 CST 1970  
|   |
|Location:   
|hdfs://localhost:9000/user/hive/warehouse_hive121/src |   |
|Table Type: |MANAGED   
|   |
|Table Parameters:   |  
|   |
|  transient_lastDdlTime |1451898360
|   |
||  
|   |
|# Storage Information   |  
|   |
|SerDe Library:  
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe|   |
|InputFormat:|org.apache.hadoop.mapred.TextInputFormat  
|   |
|OutputFormat:   
|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat|   |
|Num Buckets:|-1
|   |
|Bucket Columns: |[]
|   |
|Sort Columns:   |[]
|   |
|Storage Desc Parameters:|  
|   |
|  serialization.format  |1 
|   |

++--+---+
```

## How was this patch tested?

A test case is added to `HiveDDLSuite` to check command output.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14127-desc-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12844.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12844


commit 89e0aea04bdeffaa69be0c19f10860d12af90953
Author: Cheng Lian <l...@databricks.com>
Date:   2016-05-02T17:07:32Z

Implements native 'DESC TABE' DDL command

commit 718da25b489ef56e2b1092484bd6fbcdf2a547ed
Author: Cheng Lian <l...@databricks.com>
Date:   2016-05-02T17:39:24Z

Test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruc

[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...

2016-04-29 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12781#discussion_r61619036
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -117,101 +112,3 @@ case class ExplainCommand(
 ("Error occurred during query planning: \n" + 
cause.getMessage).split("\n").map(Row(_))
   }
 }
-
-/**
- * A command to list the column names for a table. This function creates a
- * [[ShowColumnsCommand]] logical plan.
- *
- * The syntax of using this command in SQL is:
- * {{{
- *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
- * }}}
- */
-case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
-  // The result of SHOW COLUMNS has one column called 'result'
-  override val output: Seq[Attribute] = {
-AttributeReference("result", StringType, nullable = false)() :: Nil
-  }
-
-  override def run(sparkSession: SparkSession): Seq[Row] = {
-sparkSession.sessionState.catalog.getTableMetadata(table).schema.map { 
c =>
-  Row(c.name)
-}
-  }
-}
-
-/**
- * A command to list the partition names of a table. If the partition spec 
is specified,
- * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
- * the following conditions:
- *
- * 1. If the command is called for a non partitioned table.
- * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
- *
- * This function creates a [[ShowPartitionsCommand]] logical plan
- *
- * The syntax of using this command in SQL is:
- * {{{
- *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
- * }}}
- */
-case class ShowPartitionsCommand(
-table: TableIdentifier,
-spec: Option[TablePartitionSpec]) extends RunnableCommand {
-  // The result of SHOW PARTITIONS has one column called 'result'
-  override val output: Seq[Attribute] = {
-AttributeReference("result", StringType, nullable = false)() :: Nil
-  }
-
-  private def getPartName(spec: TablePartitionSpec, partColNames: 
Seq[String]): String = {
-partColNames.map { name =>
-  PartitioningUtils.escapePathName(name) + "=" + 
PartitioningUtils.escapePathName(spec(name))
-}.mkString(File.separator)
-  }
-
-  override def run(sparkSession: SparkSession): Seq[Row] = {
-val catalog = sparkSession.sessionState.catalog
-val db = table.database.getOrElse(catalog.getCurrentDatabase)
-if (catalog.isTemporaryTable(table)) {
-  throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
temporary table: " +
-s"${table.unquotedString}")
-} else {
-  val tab = catalog.getTableMetadata(table)
-  /**
-   * Validate and throws an [[AnalysisException]] exception under the 
following conditions:
-   * 1. If the table is not partitioned.
-   * 2. If it is a datasource table.
-   * 3. If it is a view or index table.
-   */
-  if (tab.tableType == CatalogTableType.VIEW ||
-tab.tableType == CatalogTableType.INDEX) {
-throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
view or index table: " +
-  s"${tab.qualifiedName}")
-  }
-  if (!DDLUtils.isTablePartitioned(tab)) {
-throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
table that is not " +
-  s"partitioned: ${tab.qualifiedName}")
-  }
-  if (DDLUtils.isDatasourceTable(tab)) {
-throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
datasource table: " +
-  s"${tab.qualifiedName}")
-  }
-  /**
-   * Validate the partitioning spec by making sure all the referenced 
columns are
-   * defined as partitioning columns in table definition. An 
AnalysisException exception is
-   * thrown if the partitioning spec is invalid.
-   */
-  if (spec.isDefined) {
-val badColumns = 
spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains)
-if (badColumns.nonEmpty) {
-  throw new AnalysisException(
-s"Non-partitioning column(s) [${badColumns.mkString(", ")}] 
are " +
-  s"specified for SHOW PARTITIONS")
-}
-  }
-  val partNames =
-catalog.listPartitions(table, spec).map(p => getPartName(p.spec, 
tab.partitionColumnNames))
-  partNames.map { p => Row(p) }
-}
-  }
-}
--- End diff -

[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...

2016-04-29 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12781

[SPARK-14346] SHOW CREATE TABLE for data source tables

## What changes were proposed in this pull request?

This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. 
Support for Hive tables will be added in follow-up PR(s).

To show table creation DDL for data source tables created by CTAS 
statements, this PR also added partitioning and bucketing support for normal 
`CREATE TABLE ... USING ...` syntax.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)

A new test suite `ShowCreateTableSuite` is added in sql/hive package to 
test the new feature.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14346-show-create-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12781


commit bc7354026f728f06f8eaebfbc3bf05a19f2ebcd9
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-28T06:38:19Z

Moves table commands to tables.scala

commit 092d9423d982509ffc9e490618a8bf784518c854
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-29T05:42:05Z

SHOW CREATE TABLE for data source tables




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14981][SQL] Throws exception if DESC is...

2016-04-28 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12759#issuecomment-215628056
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14981][SQL] Throws exception if DESC is...

2016-04-28 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12759

[SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns

## What changes were proposed in this pull request?

Currently Spark SQL doesn't support sorting columns in descending order. 
However, the parser accepts the syntax and silently drops sorting directions. 
This PR fixes this by throwing an exception if `DESC` is specified as sorting 
direction of a sorting column.

## How was this patch tested?

A test case is added to test the invalid sorting order by checking 
exception message.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14981

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12759.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12759


commit 5a3aa2c55a3d7a600887e82b9663b52a366a4842
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-28T17:46:16Z

Throws exception if DESC is specified for sorting columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14346][SQL] Add PARTITIONED BY and CLUS...

2016-04-27 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12734#issuecomment-215310529
  
@jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but 
forgot to add the tags. Updated!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...

2016-04-27 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12081#discussion_r61283810
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -693,7 +697,6 @@ class SessionCatalog(
   dropDatabase(db, ignoreIfNotExists = false, cascade = true)
 }
 tempTables.clear()
-functionRegistry.clear()
--- End diff --

Is this related to the `current_database()` issue you and @yhuai discussed 
previously? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...

2016-04-27 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12081#discussion_r61282206
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -118,7 +148,7 @@ class DDLSuite extends QueryTest with SharedSQLContext 
with BeforeAndAfterEach {
 assert(db1 == CatalogDatabase(
   dbNameWithoutBackTicks,
   "",
-  System.getProperty("java.io.tmpdir") + File.separator + 
s"$dbNameWithoutBackTicks.db",
+  appendTrailingSlash(System.getProperty("java.io.tmpdir")) + 
s"$dbNameWithoutBackTicks.db",
--- End diff --

According to the PR description, we should append `File.separator` rather 
than slash, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...

2016-04-27 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12081#discussion_r61278961
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -693,7 +697,6 @@ class SessionCatalog(
   dropDatabase(db, ignoreIfNotExists = false, cascade = true)
 }
 tempTables.clear()
-functionRegistry.clear()
--- End diff --

Why do we remove this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14127][SQL][WIP] Describe table

2016-04-27 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12460#issuecomment-215115938
  
@dilipbiswal One purpose of re-implementing all DDL as native Spark SQL 
command is to minimize dependency to Hive so that we can move Hive into a 
separate data source some day. That said, we really don't want to make these 
new DDL commands rely on classes like `HiveClient`, `HiveClientImpl`, or 
`HiveSessionCatalog`. When you need to access Hive table metadata, you should 
access them via `CatalogTable` rather than depending on any Hive data structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Add PARTITION BY and BUCKET BY clause for "CRE...

2016-04-27 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12734

Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." 
syntax

## What changes were proposed in this pull request?

Currently, we can only create persisted partitioned and/or bucketed data 
source tables using the Dataset API but not using SQL DDL. This PR implements 
the following syntax to add partitioning and bucketing support to the SQL DDL:

```sql
CREATE TABLE 
USING  [OPTIONS ( ,  , ...)]
[PARTITIONED BY (col1, col2, ...)]
[CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO  
BUCKETS]
```

## How was this patch tested?

Test cases are added in `MetastoreDataSourcesSuite` to check the newly 
added syntax.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14954

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12734.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12734


commit f51300c6ed22a54ff1dc49262cd046774166d957
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-27T13:06:53Z

Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." 
syntax

commit af973d64cf3e1079e6c8a185d826e2e43cb06532
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-27T13:33:13Z

Moves test case to MetastoreDataSourcesSuite

Also checks for metastore table properties




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...

2016-04-26 Thread liancheng

Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/12703


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...

2016-04-26 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12703#issuecomment-214938531
  
@dilipbiswal also updated #1 and passed Jenkins. So I merged it. 
Closing this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-26 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1#issuecomment-214938170
  
@dilipbiswal Great! LGTM now, I'm merging this one to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...

2016-04-26 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12703#discussion_r61185701
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -247,4 +259,105 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   hiveContext.sessionState.hadoopConf.set("fs.default.name", 
originalFsName)
 }
   }
+
+  test("show columns") {
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN default.parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3 FROM default"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab4 IN default"),
+  Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil)
+
+val message = intercept[NoSuchTableException] {
+  sql("SHOW COLUMNS IN badtable FROM default")
+}.getMessage
+assert(message.contains("badtable not found in database"))
+  }
+
+  test("show partitions - show everything") {
+checkAnswer(
+  sql("show partitions parquet_tab4"),
+  Row("year=2015/month=1") ::
+Row("year=2015/month=2") ::
+Row("year=2016/month=2") ::
+Row("year=2016/month=3") :: Nil)
+
+checkAnswer(
+  sql("show partitions default.parquet_tab4"),
+  Row("year=2015/month=1") ::
+Row("year=2015/month=2") ::
+Row("year=2016/month=2") ::
+Row("year=2016/month=3") :: Nil)
+  }
+
--- End diff --

Oh yea, I should add that. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-26 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/1#issuecomment-214803024
  
Hi @dilipbiswal, we did a bunch of major refactoring on the master branch, 
and it's pretty close to 2.0 code freeze, so I took it over based on your 
version and opened PR #12703. Would you mind to have a look at it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...

2016-04-26 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12703

[SPARK-14445][SQL] Supports native execution of SHOW COLUMNS and SHOW 
PARTITIONS

(This PR is mostly a rebased version of PR #1 by @dilipbiswal with all 
pending PR review comments addressed. Committer who merges this PR please 
attribute it to @dilipbiswal.)

## What changes were proposed in this pull request?

This PR adds Native execution of `SHOW COLUMNS` and `SHOW PARTITION` 
commands.

Command Syntax:

```sql
SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]

SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
```

## How was this patch tested?

Added test cases in HiveCommandSuite to verify execution and 
DDLCommandSuite to verify plans.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark 
spark-14445-show-columns-partitions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12703


commit b0edb3dd09bac2d8c44235f91bc360ffa831e074
Author: Dilip Biswal <dbis...@us.ibm.com>
Date:   2016-04-01T17:25:41Z

[SPARK-14445] Support native execution of SHOW COLUMNS and SHOW PARTITIONS

commit a13db610805f9ec7817370d393a16260bfabd543
Author: Dilip Biswal <dbis...@us.ibm.com>
Date:   2016-04-14T22:14:18Z

Andrew's comments

commit c0985a8a326e6c4af13bad9fb02e7b8f2b2171d8
Author: Dilip Biswal <dbis...@us.ibm.com>
Date:   2016-04-15T06:19:20Z

test fix

commit 6cf6d954a799b9690a7f4a1b4a2a706207ea03e4
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-26T15:26:08Z

Fixes issues introduced while rebasing




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r61011875
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -424,7 +424,7 @@ private[sql] object PartitioningUtils {
 path.foreach { c =>
   if (needsEscaping(c)) {
 builder.append('%')
-builder.append(f"${c.asInstanceOf[Int]}%02x")
+builder.append(f"${c.asInstanceOf[Int]}%02X")
--- End diff --

Makes sense. Thanks for the explanation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14346][SQL] Show Create Table (Native)

2016-04-25 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12579#issuecomment-214419087
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60937575
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -36,12 +38,22 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
 |STORED AS PARQUET
 |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val")
   """.stripMargin)
+ sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)")
+ sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by 
(year int, month int)")
+ sql("INSERT INTO parquet_tab4 PARTITION(year = 2015, month=1) SELECT 
1,1")
+ sql("INSERT INTO parquet_tab4 PARTITION(year = 2015, month=2) SELECT 
2,2")
+ sql("INSERT INTO parquet_tab4 PARTITION(year = 2016, month=2) SELECT 
3,3")
+ sql("INSERT INTO parquet_tab4 PARTITION(year = 2016, month=3) SELECT 
3,3")
--- End diff --

Nit: Spaces around `=`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60937498
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -17,11 +17,13 @@
 
 package org.apache.spark.sql.hive.execution
 
-import org.apache.spark.sql.{AnalysisException, QueryTest, Row}
-import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.{AnalysisException, QueryTest, Row, SaveMode}
+import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
+import org.apache.spark.sql.hive.test.{TestHive, TestHiveSingleton}
 import org.apache.spark.sql.test.SQLTestUtils
 
 class HiveCommandSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
+  import testImplicits._
protected override def beforeAll(): Unit = {
--- End diff --

Please help fix this indentation, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60934317
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -424,7 +424,7 @@ private[sql] object PartitioningUtils {
 path.foreach { c =>
   if (needsEscaping(c)) {
 builder.append('%')
-builder.append(f"${c.asInstanceOf[Int]}%02x")
+builder.append(f"${c.asInstanceOf[Int]}%02X")
--- End diff --

Why this change is necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60934120
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
+  PartitioningUtils.escapePathName(s._1) + "=" + 
PartitioningUtils.escapePathName(s._2)
+}.mkString("/")
+  }
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+val catalog = sqlContext.sessionState.catalog
+val db = table.database.getOrElse(catalog.getCurrentDatabase)
+if (catalog.isTemporaryTable(table)) {
+  throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
temporary table: " +
+  s"${table.unquotedString}")
+} else {
+  val tab = catalog.getTableMetadata(table)
+  /**
+   * Validate and throws an [[AnalysisException]] exception under the 
following conditions:
+   * 1. If the table is not partitioned.
+   * 2. If it is a datasource table.
+   * 3. If it is a view or index table.
+   */
+  if (tab.tableType == CatalogTableType.VIRTUAL_VIEW ||
+tab.tableType == CatalogTableType.INDEX_TABLE) {
+throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
view or index table: " +
+  s"${tab.qualifiedName}")
+  }
+  if (!DDLUtils.isTablePartitioned(tab)) {
+throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
table that is not " +
+  s"partitioned: ${tab.qualifiedName}")
+  }
+  if (DDLUtils.isDatasourceTable(tab)) {
+throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
datasource table: " +
+  s"${tab.qualifiedName}")
+  }
+  /**
+   * Validate the partitioning spec by making sure all the referenced 
columns are
+   * defined as partitioning columns in table definition. An 
AnalysisException exception is
+   * thrown if the partitioning spec is invalid.
+   */
+  if (spec.isDefined) {
+val badColumns = 
spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains)
+if (badColumns.nonEmpty) {
+  throw new AnalysisException(
+s"Non-partitioned column(s) [${badColumns.mkString(", ")}] are 
" +
--- End diff --

Nit: Non-partitioning


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60931135
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -122,4 +134,105 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   checkAnswer(sql("SHOW TBLPROPERTIES parquet_temp"), Nil)
 }
   }
+
+  test("show columns") {
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN default.parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3 FROM default"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab4 IN default"),
+  Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil)
+
+val message = intercept[NoSuchTableException] {
+  sql("SHOW COLUMNS IN badtable FROM default")
+}.getMessage
+assert(message.contains("badtable not found in database"))
+  }
+
+  test("show partitions - show everything") {
+checkAnswer(
+  sql("show partitions parquet_tab4"),
+  Row("year=2015/month=1") ::
+Row("year=2015/month=2") ::
+Row("year=2016/month=2") ::
+Row("year=2016/month=3") :: Nil)
--- End diff --

As a simple experiment, order is preserved for maps with less than 5 
elements:

```scala
scala> Map("year" -> 1) foreach println
(year,1)

scala> Map("year" -> 1, "month" -> 2) foreach println
(year,1)
(month,2)

scala> Map("year" -> 1, "month" -> 2, "day" -> 3) foreach println
(year,1)
(month,2)
(day,3)

scala> Map("year" -> 1, "month" -> 2, "day" -> 3, "hour" -> 4) foreach 
println
(year,1)
(month,2)
(day,3)
(hour,4)

scala> Map("year" -> 1, "month" -> 2, "day" -> 3, "hour" -> 4, "minute" -> 
5) foreach println
(minute,5)
(year,1)
(hour,4)
(day,3)
(month,2)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60930580
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala
 ---
@@ -122,4 +134,105 @@ class HiveCommandSuite extends QueryTest with 
SQLTestUtils with TestHiveSingleto
   checkAnswer(sql("SHOW TBLPROPERTIES parquet_temp"), Nil)
 }
   }
+
+  test("show columns") {
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN default.parquet_tab3"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab3 FROM default"),
+  Row("col1") :: Row("col 2") :: Nil)
+
+checkAnswer(
+  sql("SHOW COLUMNS IN parquet_tab4 IN default"),
+  Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil)
+
+val message = intercept[NoSuchTableException] {
+  sql("SHOW COLUMNS IN badtable FROM default")
+}.getMessage
+assert(message.contains("badtable not found in database"))
+  }
+
+  test("show partitions - show everything") {
+checkAnswer(
+  sql("show partitions parquet_tab4"),
+  Row("year=2015/month=1") ::
+Row("year=2015/month=2") ::
+Row("year=2016/month=2") ::
+Row("year=2016/month=3") :: Nil)
--- End diff --

Could you please add test cases where the testing table containing >= 5 
partition columns? I believe Scala standard library provides specialized 
classes for maps containing <= 4 elements, which covers the out-of-order bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60928725
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
+  PartitioningUtils.escapePathName(s._1) + "=" + 
PartitioningUtils.escapePathName(s._2)
+}.mkString("/")
--- End diff --

@rxin I'm thinking it might not be a good idea to use `Map` to represent 
partition spec in `CatalogTablePartition` since it doesn't preserve partition 
column order. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60927623
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
+  PartitioningUtils.escapePathName(s._1) + "=" + 
PartitioningUtils.escapePathName(s._2)
+}.mkString("/")
+  }
--- End diff --

New line here please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60927757
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
+  PartitioningUtils.escapePathName(s._1) + "=" + 
PartitioningUtils.escapePathName(s._2)
+}.mkString("/")
+  }
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+val catalog = sqlContext.sessionState.catalog
+val db = table.database.getOrElse(catalog.getCurrentDatabase)
+if (catalog.isTemporaryTable(table)) {
+  throw new AnalysisException("SHOW PARTITIONS is not allowed on a 
temporary table: " +
+  s"${table.unquotedString}")
--- End diff --

Wrong indentation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60927505
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
--- End diff --

Mark this as private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60927399
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
+  PartitioningUtils.escapePathName(s._1) + "=" + 
PartitioningUtils.escapePathName(s._2)
+}.mkString("/")
--- End diff --

I don't think we can simply join all parts using here since 
`TablePartitionSpec` is a `Map` rather than a `Seq`. Order of partition columns 
is not preserved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60926964
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
+ */
+case class ShowColumnsCommand(table: TableIdentifier) extends 
RunnableCommand {
+  // The result of SHOW COLUMNS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  override def run(sqlContext: SQLContext): Seq[Row] = {
+sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c 
=>
+  Row(c.name)
+}
+  }
+}
+
+/**
+ * A command for users to list the partition names of a table. If the 
partition spec is specified,
+ * partitions that match the spec are returned. [[AnalysisException]] 
exception is thrown under
+ * the following conditions:
+ *
+ * 1. If the command is called for a non partitioned table.
+ * 2. If the partition spec refers to the columns that are not defined as 
partitioning columns.
+ *
+ * This function creates a [[ShowPartitionsCommand]] logical plan
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)]
+ * }}}
+ */
+case class ShowPartitionsCommand(
+table: TableIdentifier,
+spec: Option[TablePartitionSpec]) extends RunnableCommand {
+  // The result of SHOW PARTITIONS has one column called 'result'
+  override val output: Seq[Attribute] = {
+AttributeReference("result", StringType, nullable = false)() :: Nil
+  }
+
+  def getPartName(spec: TablePartitionSpec): String = {
+spec.map {s =>
--- End diff --

Space after `{`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60926196
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
+ *
+ * The syntax of using this command in SQL is:
+ * {{{
+ *   SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database];
+ * }}}
--- End diff --

We don't need to mention the SQL syntax here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60925953
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand(
 }
 
 /**
+ * A command for users to list the column names for a table. This function 
creates a
+ * [[ShowColumnsCommand]] logical plan.
--- End diff --

"A command to list column names of a given table."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60923441
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
 ---
@@ -277,9 +277,18 @@ class InMemoryCatalog extends ExternalCatalog {
 catalog(db).tables(table).partitions(spec)
   }
 
+  /**
+   * List the metadata of all partitions that belong to the specified 
table, assuming it exists.
+   *
+   * A partial partition spec may optionally be provided to filter the 
partitions returned.
+   * For instance, if there exist partitions (a='1', b='2'), (a='1', 
b='3') and (a='2', b='4'),
+   * then a partial spec of (a='1') will return the first two only.
+   * TODO: Currently partialSpec is not used for memory catalog and it 
returns all the partitions.
--- End diff --

Can we throw exception instead of returning all the partitions if a 
partition spec is given to an in-memory catalog? Silently returning wrong 
answer can be dangerous.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60923122
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -427,13 +427,18 @@ class SessionCatalog(
   }
 
   /**
-   * List all partitions in a table, assuming it exists.
-   * If no database is specified, assume the table is in the current 
database.
+   * List the metadata of all partitions that belong to the specified 
table, assuming it exists.
+   *
+   * A partial partition spec may optionally be provided to filter the 
partitions returned.
+   * For instance, if there exist partitions (a='1', b='2'), (a='1', 
b='3') and (a='2', b='4'),
+   * then a partial spec of (a='1') will return the first two only.
*/
-  def listPartitions(tableName: TableIdentifier): 
Seq[CatalogTablePartition] = {
+  def listPartitions(
--- End diff --

Nit: Missing `override`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60923039
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -427,13 +427,18 @@ class SessionCatalog(
   }
 
   /**
-   * List all partitions in a table, assuming it exists.
-   * If no database is specified, assume the table is in the current 
database.
+   * List the metadata of all partitions that belong to the specified 
table, assuming it exists.
+   *
+   * A partial partition spec may optionally be provided to filter the 
partitions returned.
+   * For instance, if there exist partitions (a='1', b='2'), (a='1', 
b='3') and (a='2', b='4'),
+   * then a partial spec of (a='1') will return the first two only.
--- End diff --

Same as above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...

2016-04-25 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r60922965
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala
 ---
@@ -277,9 +277,18 @@ class InMemoryCatalog extends ExternalCatalog {
 catalog(db).tables(table).partitions(spec)
   }
 
+  /**
+   * List the metadata of all partitions that belong to the specified 
table, assuming it exists.
+   *
+   * A partial partition spec may optionally be provided to filter the 
partitions returned.
+   * For instance, if there exist partitions (a='1', b='2'), (a='1', 
b='3') and (a='2', b='4'),
+   * then a partial spec of (a='1') will return the first two only.
--- End diff --

Nit: No need to repeat the comment here since it's automatically inherited 
from the parent class/trait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14875][SQL] Makes OutputWriterFactory.n...

2016-04-25 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12652#issuecomment-214290243
  
Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14875][SQL] Makes OutputWriterFactory.n...

2016-04-24 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12652

[SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public

## What changes were proposed in this pull request?

This method was accidentally made `private[sql]` in Spark 2.0. This PR 
makes it public again.

## How was this patch tested?

N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14875

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12652.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12652


commit 35be32f702f5d3f3a68ceb0d01dbb5fb2e1adb41
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-23T14:56:51Z

Makes OutputWriterFactory.newInstance public




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-22 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-213659267
  
LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-22 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-213658964
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-22 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-213658842
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14843][ML] Fix encoding error in LibSVM...

2016-04-22 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12611#issuecomment-213514108
  
LGTM. Merging to master, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-22 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12527#discussion_r60695719
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -131,4 +134,23 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[RDDPartition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: RDDPartition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+
+// Computes total number of bytes can be retrieved from each host.
+val hostToNumBytes = mutable.HashMap.empty[String, Long]
+files.foreach { file =>
+  file.locations.filter(_ != "localhost").foreach { host =>
--- End diff --

We should filter them out. For a partition that doesn't have any preferred 
locations, it can be bundled with any other tasks and scheduled to any 
executor. But once it's marked with "localhost", delayed scheduling may be 
triggered because they have different host name as other tasks. Further more, 
"localhost" isn't a valid location for the `DAGScheduler` when deciding which 
executors to run the tasks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213167317
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-21 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-213034425
  
The root cause of the deadlock has been found. Essentially, we should 
prevent "localhost" to be returned as `FileScanRDD` preferred locations. Here's 
a detailed description of the whole processes:

1.  The test case involves a shuffle, which results in a `ShuffleRowRDD`, 
whose preferred locations are determined by the locations of the block manager 
that serves corresponding map output blocks. In a word, in the case of local 
test, the only preferred location string is the IP of the block manager.
1.  In the case of local testing, `FileScanRDD.preferredLocations` always 
returns "localhost".
1.  As a result, task set `ts1` derived from the `ShuffleRowRDD` and task 
set `ts2` derived from the `FileScanRDD` have different locality preference.
1.  After job submission, `DAGScheduler` first schedules `ts1`. While 
trying to schedule `ts2`, delayed scheduling is triggered because `ts1` and 
`ts2` have different preferred locations. By default, `DAGScheduler` waits for 
3s before trying `ts2` again.
1.  3s is long enough for all tasks in `ts1` to finish. However, 
`LocalBackend` doesn't revive offers periodically like other scheduler 
backends. It only revives offer when tasks are submitted, finish, or fail. Thus 
`ts2` never gets an opportunity to be scheduled again, and the submitted job 
never finishes.

The only factor that is not clear for now is how the number of buckets 
(which affects number of submitted tasks) interact with the above process.

The fix for this issue is simple, just filter out all "localhost" in 
`FileScanRDD.preferredLocations()` since "localhost" doesn't make sense as a 
preferred executor location. Actually this is exactly the last step of what 
`NewHadoopRDD.preferredLocations()` does.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212696111
  
Yea, agree. Should either prove this is an existing bug (so that it can be 
fixed in another PR), or fix it if it's a bug introduced by this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12527#issuecomment-212425913
  
@yhuai This one can probably avoid the deadlock since I worked around it 
[here][1]. Also check [this comment] for more information.

Not sure whether we should merge this one since I haven't figured out 
whether the deadlock is caused bug(s) in this PR or in `DAGScheduler`.

[1]: 
https://github.com/apache/spark/pull/12527/files#diff-75164dfeac2c1507ae828d2ba5529470R304
[2]: https://github.com/apache/spark/pull/12153#issuecomment-212424785


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212424940
  
I'm closing this one for #12527, which is a rebased version of this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/12153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-212424785
  
All the timeout in the Jenkins builds were due to a deadlock in 
`DAGScheduler`, and can be steadily reproduced locally by running the following 
test case

> BucketedReadSuite.only shuffle one side when 2 bucketed tables have 
different bucket keys.

This test case creates two bucketed tables both with 8 buckets and then 
joins them. Reducing 8 to 5 eliminates the deadlock. But I haven't figured out 
the real reason behind the deadlock. The deadlock also disappears if I remove 
FileScanRDD.preferredLocations(). Maybe that too many tasks are scheduled to 
the same place and exhausted some thread-pool?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-20 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12527

[SPARK-14369][SQL] Locality support for FileScanRDD

(This PR is a rebased version of PR #12153.)

## What changes were proposed in this pull request?

This PR adds preliminary locality support for `FileFormat` data sources by 
overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into 
two parts:

1.  Block location lookup

Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access 
to the underlying `InputFormat` or `InputSplit`, and thus can't rely on 
`InputSplit.getLocations()` to gather locality information. Instead, this PR 
queries block locations using `FileSystem.getBlockLocations()` after listing 
all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into 
`LocatedFileStatus`es.

Note that although S3/S3A/S3N file systems don't provide valid locality 
information, their `getLocatedStatus()` implementations don't actually issue 
remote calls either. So there's no need to special case these file systems.

2.  Selecting preferred locations

For each `FilePartition`, we pick up top 3 locations that containing 
the most data to be retrieved. This isn't necessarily the best algorithm out 
there. Further improvements may be brought up in follow-up PRs.

## How was this patch tested?

Tested by overriding default `FileSystem` implementation for `file:///` 
with a mocked one, which returns mocked block locations.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14369-locality-rebased

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12527


commit 0358e33c1a796b7875f932c4d4aee28db505f696
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-20T13:25:51Z

Rebased PR #12153




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12319#issuecomment-212323419
  
@rajeshbalamohan Sorry for the late review and thanks for working on this! 
My major concern is that the newly added `OrcRecordReader` should be live in 
spark-hive rather than spark-sql. Otherwise it looks good except for a few 
styling issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60368033
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -337,10 +340,12 @@ private[orc] case class OrcTableScan(
 
 rdd.mapPartitionsWithInputSplit { case (split: OrcSplit, iterator) =>
   val writableIterator = iterator.map(_._2)
+  val maybeStructOI = OrcFileOperator.getObjectInspector(
+split.getPath.toString, Some(conf))
--- End diff --

Nit: No need to wrap here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60367904
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -160,19 +160,22 @@ private[sql] class DefaultSource
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val inputFormat = new OrcNewInputFormat
   val fileSplit = new FileSplit(
 new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
   )
-
-  val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
-  val hadoopAttemptContext = new TaskAttemptContextImpl(conf, 
attemptId)
-  inputFormat.createRecordReader(fileSplit, hadoopAttemptContext)
+  // Create RecordReader here. This will help in getting
+  // ObjectInspector during recordReader creation itself and can
+  // avoid NN call in unwrapOrcStructs to get ObjectInspector for 
the
+  // file. Would be helpful for partitioned datasets.
+  new OrcRecordReader(OrcFile.createReader(new Path(new URI(file
+.filePath)), OrcFile.readerOptions(conf)), conf, 
fileSplit.getStart(),
+fileSplit.getLength())
 }
 
 // Unwraps `OrcStruct`s to `UnsafeRow`s
-val unsafeRowIterator = OrcRelation.unwrapOrcStructs(
-  file.filePath, conf, requiredSchema, new 
RecordReaderIterator[OrcStruct](orcRecordReader)
+val unsafeRowIterator = OrcRelation.unwrapOrcStructs(conf, 
requiredSchema,
+  
Some(orcRecordReader.getObjectInspector.asInstanceOf[StructObjectInspector]),
+  new RecordReaderIterator[OrcStruct](orcRecordReader)
--- End diff --

Nit: Usually we wrap function arguments like this

```scala
func(
  arg1, arg2, ...)
```

or

```scala
func(
  arg1,
  arg2,
  ...)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60365888
  
--- Diff: sql/core/pom.xml ---
@@ -128,6 +128,10 @@
   xbean-asm5-shaded
   test
 
+
+  org.spark-project.hive
+  hive-exec
+
--- End diff --

Why do we need to add this dependency to spark-core? The ORC data source 
lives in spark-hive. Can we move the newly added `OrcRecordReader.java` to 
spark-hive and remove this dependency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60365709
  
--- Diff: 
sql/core/src/main/java/org/apache/hadoop/hive/ql/io/orc/OrcRecordReader.java ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.mapreduce.InputSplit;
+import org.apache.hadoop.mapreduce.RecordReader;
+import org.apache.hadoop.mapreduce.TaskAttemptContext;
+
+import java.io.IOException;
+import java.util.List;
+
+/**
+ * This is based on
+ * {@link 
org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.OrcRecordReader}.
--- End diff --

Would be nice to mention the specific version to which the original class 
belongs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60365527
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -160,19 +160,22 @@ private[sql] class DefaultSource
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val inputFormat = new OrcNewInputFormat
   val fileSplit = new FileSplit(
 new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
   )
-
-  val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
-  val hadoopAttemptContext = new TaskAttemptContextImpl(conf, 
attemptId)
-  inputFormat.createRecordReader(fileSplit, hadoopAttemptContext)
+  // Create RecordReader here. This will help in getting
+  // ObjectInspector during recordReader creation itself and can
+  // avoid NN call in unwrapOrcStructs to get ObjectInspector for 
the
+  // file. Would be helpful for partitioned datasets.
--- End diff --

It would be good to emphasize that this `OrcRecordReader` is a customized 
one. People who are not familiar with ORC may think it's the default ORC record 
reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...

2016-04-20 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12319#discussion_r60365406
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala ---
@@ -160,19 +160,22 @@ private[sql] class DefaultSource
   val job = Job.getInstance(conf)
   FileInputFormat.setInputPaths(job, file.filePath)
 
-  val inputFormat = new OrcNewInputFormat
   val fileSplit = new FileSplit(
 new Path(new URI(file.filePath)), file.start, file.length, 
Array.empty
   )
-
-  val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
-  val hadoopAttemptContext = new TaskAttemptContextImpl(conf, 
attemptId)
-  inputFormat.createRecordReader(fileSplit, hadoopAttemptContext)
+  // Create RecordReader here. This will help in getting
+  // ObjectInspector during recordReader creation itself and can
+  // avoid NN call in unwrapOrcStructs to get ObjectInspector for 
the
+  // file. Would be helpful for partitioned datasets.
+  new OrcRecordReader(OrcFile.createReader(new Path(new URI(file
+.filePath)), OrcFile.readerOptions(conf)), conf, 
fileSplit.getStart(),
+fileSplit.getLength())
--- End diff --

Nit: Please split this call into the following to for better readability:

```scala
val orcReader = OrcFile.createReader(new Path(new URI(file.filePath)), 
OrcFile.readerOptions(conf))
new OrcRecordReader(orcReader, conf, fileSplit.getStart(), 
fileSplit.getLength())
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14637][SQL] object expressions cleanup

2016-04-19 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12399#discussion_r60267607
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala
 ---
@@ -130,60 +126,52 @@ case class Invoke(
 case _ => None
   }
 
-  lazy val unboxer = (dataType, 
method.map(_.getReturnType.getName).getOrElse("")) match {
-case (IntegerType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Integer)$s).intValue()"
-case (LongType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Long)$s).longValue()"
-case (FloatType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Float)$s).floatValue()"
-case (ShortType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Short)$s).shortValue()"
-case (ByteType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Byte)$s).byteValue()"
-case (DoubleType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Double)$s).doubleValue()"
-case (BooleanType, "java.lang.Object") => (s: String) =>
-  s"((java.lang.Boolean)$s).booleanValue()"
-case _ => identity[String] _
-  }
-
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): String = {
 val javaType = ctx.javaType(dataType)
 val obj = targetObject.genCode(ctx)
 val argGen = arguments.map(_.genCode(ctx))
 val argString = argGen.map(_.value).mkString(", ")
 
-// If the function can return null, we do an extra check to make sure 
our null bit is still set
-// correctly.
-val objNullCheck = if (ctx.defaultValue(dataType) == "null") {
-  s"boolean ${ev.isNull} = ${ev.value} == null;"
+val callFunc = if (method.isDefined && 
method.get.getReturnType.isPrimitive) {
+  s"${obj.value}.$functionName($argString)"
 } else {
-  ev.isNull = obj.isNull
-  ""
+  s"(${ctx.boxedType(javaType)}) 
${obj.value}.$functionName($argString)"
 }
 
-val value = unboxer(s"${obj.value}.$functionName($argString)")
--- End diff --

Don't we need to do unboxing anymore here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-18 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12179#discussion_r60094164
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -288,15 +288,25 @@ case class DataSource(
 
 val fileCatalog: FileCatalog =
   new HDFSFileCatalog(sqlContext, options, globbedPaths, 
partitionSchema)
-val dataSchema = userSpecifiedSchema.orElse {
+
+val dataSchema = userSpecifiedSchema.map { schema =>
+  val equality =
+if (sqlContext.conf.caseSensitiveAnalysis) {
+  
org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
+} else {
+  
org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
+}
+
+  StructType(schema.filterNot(f => 
partitionColumns.exists(equality(_, f.name
+}.orElse {
--- End diff --

@yhuai Answering the question we raised offline here. Without this fix, the 
following test case added back in this PR fails:

> SimpleTextHadoopFsRelationSuite.SPARK-7616: adjust column name order 
accordingly when saving partitioned table

The major contradiction here is that, result of `FileFormat.inferSchema()` 
is a schema consists of all columns live in physical data files, and may 
contain a subset of partition columns. On the otherhand, the user-specified 
schema passed via `DataFrameReader.schema()` refers to the full schema of the 
table, including all the partition columns. For `FileFormat` data sources whose 
`inferSchema()` return `None`, we have no idea whether the physical files 
contain partiiton columns or not.

To fix the regression failure, here we chop off all partition columns from 
user-specified schema. But this imposes a restriction to `FileFormat` data 
sources without schema inference ability:

> No partition columns are allowed in physical files.

This doesn't make much trouble for Spark built-in `FileFormat` data sources 
since all of them either have fixed schema (LibSVM and text), or are able to 
infer their own schema (Parquet, ORC, JSON, and CSV). I've checked that, this 
restriction also exists in branch-1.6. But I'd say this restriction is more 
like by accident rather than by design.

An alternative design is to alter the semantics of the user-specified 
schema set via `DataFrameReader.schema()` and make it represent the schema of 
the physical files. In this way, we can solve the problem unambiguously. But 
this apparently this may break runtime behavior of existing user code. So seems 
that living with it is a more reasonable choice for now?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211446073
  
Seems that Jenkins is down...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-18 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-211423901
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12153#issuecomment-210092607
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59766576
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -624,20 +624,40 @@ class HDFSFileCatalog(
 
   def getStatus(path: Path): Array[FileStatus] = 
leafDirToChildrenFiles(path)
 
+  private implicit class LocatedFileStatusIterator(iterator: 
RemoteIterator[LocatedFileStatus])
--- End diff --

It's used [here][1]. `RemoteIterator` doesn't extend from `Iterator`. This 
implicit conversion helps to convert contents behind a `RemoteIterator` into an 
array.

[1]: 
https://github.com/apache/spark/pull/12153/files#diff-40c347747af9101e7e9fee52fc4120b8R656


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59707908
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
+blockLocations.indexWhere { b =>
+  b.getOffset <= offset && offset < b.getOffset + b.getLength
+}
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the block that contains this `offset`, and 
returns location hosts of
+  // that block. If no such block can be found, returns an empty array.
+  private def getBlockHosts(blockLocations: Array[BlockLocation], offset: 
Long): Array[String] = {
--- End diff --

Makes sense, updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-14 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59707891
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -86,4 +87,9 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[Partition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: Partition): 
Seq[String] = {
+val files = split.asInstanceOf[FilePartition].files
+if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations
--- End diff --

Updated. @cloud-fan also suggested the same strategy. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-13 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12179#discussion_r59596010
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala
 ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+
+class CommitFailureTestRelationSuite extends SQLTestUtils with 
TestHiveSingleton {
+  // When committing a task, `CommitFailureTestSource` throws an exception 
for testing purpose.
+  val dataSourceName: String = 
classOf[CommitFailureTestSource].getCanonicalName
+
+  test("SPARK-7684: commitTask() failure should fallback to abortTask()") {
--- End diff --

But yea, we should probably update these test cases without testing 
`unhandledFilters` and add them back to test filter push-down.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-13 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12179#discussion_r59595837
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala
 ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.deploy.SparkHadoopUtil
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.test.SQLTestUtils
+
+class CommitFailureTestRelationSuite extends SQLTestUtils with 
TestHiveSingleton {
+  // When committing a task, `CommitFailureTestSource` throws an exception 
for testing purpose.
+  val dataSourceName: String = 
classOf[CommitFailureTestSource].getCanonicalName
+
+  test("SPARK-7684: commitTask() failure should fallback to abortTask()") {
--- End diff --

Do you mean those filter push-down tests in the original 
`SimpleTextHadoopFsRelationSuite`? Those weren't added because 
`HadoopFsRelation` doesn't have `unhandledFilters` now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59595284
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  // Given locations of all blocks of a single file, `blockLocations`, and 
an `offset` position of
+  // the same file, find out the index of the block that contains this 
`offset`. If no such block
+  // can be found, returns -1.
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
--- End diff --

This one is a little bit tricky, would like to keep it here for readability.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59595123
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
--- End diff --

Yea, this is better, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-13 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59594996
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
 ---
@@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest 
with SQLTestUtils with Tes
   df.write.format(dataSourceName).partitionBy("c", "d", 
"e").saveAsTable("t")
 }
   }
+
+  test("Locality support for FileScanRDD") {
+withHadoopConf(
+  "fs.file.impl" -> classOf[LocalityTestFileSystem].getName,
+  "fs.file.impl.disable.cache" -> "true"
+) {
+  withTempPath { dir =>
+val path = "file://" + dir.getCanonicalPath
+val df1 = sqlContext.range(4)
+
df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path)
+
df1.coalesce(1).write.mode("append").format(dataSourceName).save(path)
+
+val df2 = sqlContext.read
+  .format(dataSourceName)
+  .option("dataSchema", df1.schema.json)
+  .load(path)
+
+val Some(fileScanRDD) = {
+  def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = {
+rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG)
+  }
+
+  // We have to search for the `FileScanRDD` along the RDD DAG 
since
+  // RDD.preferredLocations doesn't propagate along the DAG to the 
root RDD.
+  allRDDsInDAG(df2.rdd).collectFirst {
+case f: FileScanRDD => f
+  }
+}
+
+val partitions = fileScanRDD.partitions
--- End diff --

Not necessarily, and probably only 1, since both files written are pretty 
small and will be merged into a single partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...

2016-04-13 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12178#issuecomment-209510234
  
Closing this in favor of #12361 since master went through some major 
changes and it would be hard to resolve conflicts of this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...

2016-04-13 Thread liancheng

Github user liancheng closed the pull request at:

https://github.com/apache/spark/pull/12178


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...

2016-04-13 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12361

[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into 
execution/datasources package #12178 

## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into 
`execution/datasources` package.

Note that to avoid conflicts, this PR is based on #12153. Effective commits 
for this PR only consist of the last two. Will rebase after merging #12153.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark 
spark-14407-hide-hadoop-fs-relation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12361


commit 06b2afb9f3bd5dc4833e939f3f522168252eaa4b
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-04T15:33:18Z

Locality support for FileScanRDD

commit d94f7e95fc772d26c9d515a8ab9b6b96432fc8ff
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-04T17:21:22Z

Fixes import order

commit 5f214c12bdf133d90ceac441130af18af2fb7e6a
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-04T18:38:46Z

Fixes cases where block location list is empty

commit f036bb10304eb8f127a9956c178d6948a470ed9c
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T12:29:42Z

Adds withHadoopConf utility method

commit 426b52817d80b36f2c585ae498580a317e642312
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T14:27:36Z

Tests

commit 082d4a3ee7034dae3447cde68ae0bbcae3a80632
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T16:59:58Z

Moves the test case to HadoopFsRelationTest

commit f72d0ffbf15d599918f275ecf1a6c76223a6de97
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-06T14:54:53Z

Addresses PR comments

commit deb0f8509f6edff890f7882a3ce36d4cd0c6e403
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-06T14:59:22Z

Adjusts file partition size fraction threshold to 10%

commit 25224e1d09cd37beb4a922ad238f59c9e3732efb
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-07T16:44:56Z

Fixes test case flakiness

commit e666dea2241f400dd44abd2161db269ab7fb71af
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-11T13:29:17Z

Uses getLocatedStatus() instead of getStatus() + getBlockLocations()

commit 7cd40d989cc5008287547640ee47b0b89d59bdee
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-12T14:33:49Z

Fixes import order

commit 843e050b3baea14bff042e62c519518720d33c0a
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-13T13:55:01Z

Makes locality test case more robust

commit cef40bfb940a24e43ac01e4c521be6de51f5662d
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-13T13:59:09Z

Simplifies locality selection

commit af018f8b501ea3a396605e05889ac7cbed00cc78
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-13T15:23:23Z

Hides HadoopFsRelation related data source API

commit c5fba2f9ab821ef5c577a244b7c157549bb819e5
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-13T15:25:04Z

Fixes MiMA




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-13 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12179#issuecomment-209462358
  
@cloud-fan Actually not completely copy-pasted since we did major changes 
to the data sources API...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12179#issuecomment-209047957
  
Not sure why MiMA complained about `InputMetrics` stuff...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12179#issuecomment-209047839
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12179#issuecomment-209047622
  
@yhuai This should be ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-12 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12179#discussion_r59430563
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala ---
@@ -222,7 +222,7 @@ case class CreateMetastoreDataSourceAsSelect(
 val data = Dataset.ofRows(hiveContext, query)
 val df = existingSchema match {
   // If we are inserting into an existing table, just use the existing 
schema.
-  case Some(s) => 
sqlContext.internalCreateDataFrame(data.queryExecution.toRdd, s)
+  case Some(s) => data.selectExpr(s.fieldNames: _*)
--- End diff --

Fix for SPARK-14566 by using a projection instead of simply applying 
existing schema to input query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...

2016-04-12 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12179#discussion_r59430279
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -288,15 +288,25 @@ case class DataSource(
 
 val fileCatalog: FileCatalog =
   new HDFSFileCatalog(sqlContext, options, globbedPaths, 
partitionSchema)
-val dataSchema = userSpecifiedSchema.orElse {
+
+val dataSchema = userSpecifiedSchema.map { schema =>
+  val equality =
+if (sqlContext.conf.caseSensitiveAnalysis) {
+  
org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution
+} else {
+  
org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution
+}
+
+  StructType(schema.filterNot(f => 
partitionColumns.exists(equality(_, f.name
+}.orElse {
--- End diff --

Fix for SPARK-14458.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12327#issuecomment-208939061
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12327#issuecomment-208939090
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPOR...

2016-04-12 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12303#issuecomment-208930853
  
Thanks for the review! Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPOR...

2016-04-11 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12303#discussion_r59306581
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -1852,4 +1852,31 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test(
+"SPARK-14488 \"CREATE TEMPORARY TABLE ... USING ... AS SELECT ...\" " +
+"shouldn't create persisted table"
+  ) {
+withTempPath { dir =>
+  withTempTable("t1", "t2") {
+val path = dir.getCanonicalPath
+val ds = sqlContext.range(10)
+ds.registerTempTable("t1")
+
+sql(
+  s"""CREATE TEMPORARY TABLE t2
+ |USING PARQUET
+ |OPTIONS (PATH '$path')
+ |AS SELECT * FROM t1
--- End diff --

By "persisted" I mean the table metadata is persisted into Hive metastore. 
Since we are fixing `CreateTempTableUsingAsSelect`, this syntax IS the one we 
are going to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14488][SQL] "CREATE TEMPORARY TABLE ......

2016-04-11 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12303

[SPARK-14488][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" 
shouldn't create persisted table

## What changes were proposed in this pull request?

When planning logical plan node `CreateTableUsingAsSelect`, we neglected 
its `temporary` field and always generates a 
`CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating 
`CreateTempTableUsingAsSelect` when `temporary` is true.

## How was this patch tested?

Added a test case to create a temporary table using the target syntax and 
check whether it's indeed a temporary table.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14488-fix-ctas-using

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12303


commit eefc8cdc8373c835d2983ea7343487590d9826f8
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-11T17:34:06Z

Fixes SPARK-14488




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14535][SQL] Remove buildInternalScan fr...

2016-04-11 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12300#issuecomment-208411048
  
And the ORC data source, which lives in the `sql/hive`, hasn't been cleaned 
yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14535][SQL] Remove buildInternalScan fr...

2016-04-11 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12300#issuecomment-208410251
  
We can also remove [these pattern match guard conditions][1] and the 
`spark.sql.sources.fileScan` SQL option.

[1]: 
https://github.com/apache/spark/blob/e82d95bf63f57cefa02dc545ceb451ecdeedce28/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L59-L66


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14372] [SQL] : Dataset.randomSplit() ne...

2016-04-11 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12184#issuecomment-208244759
  
Thanks! Merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14372] [SQL] : Dataset.randomSplit() ne...

2016-04-11 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/12184#issuecomment-208243291
  
LGTM

@marmbrus Returning `Seq` seems to be not so Java-friendly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-08 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r59057227
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

This is a really good pointer, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...

2016-04-07 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58909434
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+// Finds out a list of location hosts where lives the 
first block that contains the
+// starting point the file block starts at `offset` with 
length `size`.
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

Each `FilePartition` derives their own preferred location in 
`FileScanRDD.preferredLocations()` from location hosts of all its 
`PartitionedFile`s.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

2016-04-06 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58720687
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala
 ---
@@ -86,4 +87,8 @@ class FileScanRDD(
   }
 
   override protected def getPartitions: Array[Partition] = 
filePartitions.toArray
+
+  override protected def getPreferredLocations(split: Partition): 
Seq[String] = {
+split.asInstanceOf[FilePartition].files.flatMap(_.locations).distinct
--- End diff --

Added a constraint so that we only pick locations of blocks whose sizes are 
larger than 10% of the whole file partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

2016-04-06 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58720699
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -205,4 +210,20 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
 case _ => Nil
   }
+
+  private def getBlockLocations(file: FileStatus): Array[BlockLocation] = 
file match {
+case f: LocatedFileStatus => f.getBlockLocations
+case f => Array.empty[BlockLocation]
+  }
+
+  private def getBlockIndex(blockLocations: Array[BlockLocation], offset: 
Long): Int = {
--- End diff --

Added comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

2016-04-06 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58720668
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -620,17 +621,28 @@ class HDFSFileCatalog(
 if (paths.length >= 
sqlContext.conf.parallelPartitionDiscoveryThreshold) {
   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
sqlContext.sparkContext)
 } else {
-  val statuses = paths.flatMap { path =>
+  val statuses: Seq[FileStatus] = paths.flatMap { path =>
 val fs = path.getFileSystem(hadoopConf)
 logInfo(s"Listing $path on driver")
 // Dummy jobconf to get to the pathFilter defined in configuration
-val jobConf = new JobConf(hadoopConf, this.getClass())
+val jobConf = new JobConf(hadoopConf, this.getClass)
 val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
-if (pathFilter != null) {
+val statuses = if (pathFilter != null) {
   Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
 } else {
   Try(fs.listStatus(path)).getOrElse(Array.empty)
 }
+
+fs match {
+  case _: S3FileSystem =>
--- End diff --

Thanks, made corresponding changes and added comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

2016-04-06 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58720664
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala ---
@@ -620,17 +621,28 @@ class HDFSFileCatalog(
 if (paths.length >= 
sqlContext.conf.parallelPartitionDiscoveryThreshold) {
   HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, 
sqlContext.sparkContext)
 } else {
-  val statuses = paths.flatMap { path =>
+  val statuses: Seq[FileStatus] = paths.flatMap { path =>
 val fs = path.getFileSystem(hadoopConf)
 logInfo(s"Listing $path on driver")
 // Dummy jobconf to get to the pathFilter defined in configuration
-val jobConf = new JobConf(hadoopConf, this.getClass())
+val jobConf = new JobConf(hadoopConf, this.getClass)
 val pathFilter = FileInputFormat.getInputPathFilter(jobConf)
-if (pathFilter != null) {
+val statuses = if (pathFilter != null) {
   Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty)
 } else {
   Try(fs.listStatus(path)).getOrElse(Array.empty)
 }
+
+fs match {
+  case _: S3FileSystem =>
+// S3 doesn't provide locality information.
+statuses.toSeq
+
+  case _ =>
+statuses.map { f =>
+  new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, 
f.getLen))
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...

2016-04-06 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/12153#discussion_r58720692
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala
 ---
@@ -137,10 +140,12 @@ private[sql] object FileSourceStrategy extends 
Strategy with Logging {
 
   val splitFiles = selectedPartitions.flatMap { partition =>
 partition.files.flatMap { file =>
+  val blockLocations = getBlockLocations(file)
   (0L to file.getLen by maxSplitBytes).map { offset =>
 val remaining = file.getLen - offset
 val size = if (remaining > maxSplitBytes) maxSplitBytes 
else remaining
-PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size)
+val hosts = getBlockHosts(blockLocations, offset)
+PartitionedFile(partition.values, 
file.getPath.toUri.toString, offset, size, hosts)
--- End diff --

Added.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13681][SQL][WIP] Add back once removed ...

2016-04-05 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12179

[SPARK-13681][SQL][WIP] Add back once removed CommitFailureTestRelationSuite

## What changes were proposed in this pull request?

This test suite was removed while refactoring `HadoopFsRelation` related 
API. This PR brings it back.

It's still WIP because `SimpleTextHadoopFsRelationSuite` still fails. 
Fixing it.

## How was this patch tested?

A testing relation that always fails committing write tasks is used to test 
task commit failure.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark 
spark-13681-commit-failure-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12179.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12179


commit 4cc1adc8049bdf83255e16875ebb9d970eac00d2
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T17:56:26Z

WIP

commit 3b784501eb4831e6355aeb8b131a31cc2c4eada8
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T18:19:30Z

Comments out failed test suite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...

2016-04-05 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/12178

[SPARK-14407][SQL] Hides HadoopFsRelation related data source API into 
execution package

## What changes were proposed in this pull request?

This PR moves `HadoopFsRelation` related data source API into execution 
package.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark spark-14407-hide-file-scan-api

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12178.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12178


commit 64b7cf487c59aee3217aae37733ff9879dd79c2c
Author: Cheng Lian <l...@databricks.com>
Date:   2016-04-05T16:58:52Z

Hides HadoopFsRelation related data source API




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 5176 matches

Mail list logo