[GitHub] spark issue #16841: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16841
  
**[Test build #72984 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72984/testReport)**
 for PR 16841 at commit 
[`850aacd`](https://github.com/apache/spark/commit/850aacdec86a1dc1ffc4c4f1b77f828f4aa1078f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101461535
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -161,23 +161,49 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   bucketSpec,
   Some(partitionSchema))
 
+val catalogTable = metastoreRelation.catalogTable
 val logicalRelation = cached.getOrElse {
   val sizeInBytes =
 
metastoreRelation.stats(sparkSession.sessionState.conf).sizeInBytes.toLong
   val fileIndex = {
-val index = new CatalogFileIndex(
-  sparkSession, metastoreRelation.catalogTable, sizeInBytes)
+val index = new CatalogFileIndex(sparkSession, catalogTable, 
sizeInBytes)
 if (lazyPruningEnabled) {
   index
 } else {
   index.filterPartitions(Nil)  // materialize all the 
partitions in memory
 }
   }
   val partitionSchemaColumnNames = 
partitionSchema.map(_.name.toLowerCase).toSet
-  val dataSchema =
-StructType(metastoreSchema
+  val filteredMetastoreSchema = StructType(metastoreSchema
   .filterNot(field => 
partitionSchemaColumnNames.contains(field.name.toLowerCase)))
 
+  val inferenceMode = 
sparkSession.sessionState.conf.schemaInferenceMode
+  val dataSchema = if (inferenceMode != "NEVER_INFER" &&
+  !catalogTable.schemaFromTableProps) {
+val fileStatuses = fileIndex.listFiles(Nil).flatMap(_.files)
+val inferred = defaultSource.inferSchema(sparkSession, 
options, fileStatuses)
--- End diff --

I'll add an info log here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101461357
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  // Create a CatalogTable instance modeling an external Hive table in a 
metastore that isn't
+  // controlled by Spark (i.e. has no Spark-specific table properties set).
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")))
+
+  // Creates a case-sensitive external Hive table for testing schema 
inference options. Table
+  // will not have Spark-specific table properties set.
+  private def setupCaseSensitiveTable(
+  tableName: String,
+  dir: File): Unit = {
+spark.range(NUM_RECORDS)
+  .selectExpr("id as fieldOne", "id as partCol1", "id as partCol2")
+  .write
+  .partitionBy("partCol1", "partCol2")
+  .mode("overwrite")
+  .parquet(dir.getAbsolutePath)
+
+val lowercaseSchema = StructType(Seq(
+  StructField("fieldone", LongType),
+  StructField("partcol1", IntegerType),
+  StructField("partcol2", IntegerType)))
+
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+val catalogTable = 

[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101461155
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  // Create a CatalogTable instance modeling an external Hive table in a 
metastore that isn't
+  // controlled by Spark (i.e. has no Spark-specific table properties set).
--- End diff --

I wrote the method to take arbitrary properties but for the purposes of 
this test only an empty map is supplied. I'll make the comment more applicable 
to the method though and describe the usage of it elsewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-15 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16620
  
@kayousterhout @squito @markhamstra 
Thanks for all of your work for this patch. Really appreciate your help : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101460842
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -161,23 +161,49 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   bucketSpec,
   Some(partitionSchema))
 
+val catalogTable = metastoreRelation.catalogTable
 val logicalRelation = cached.getOrElse {
   val sizeInBytes =
 
metastoreRelation.stats(sparkSession.sessionState.conf).sizeInBytes.toLong
   val fileIndex = {
-val index = new CatalogFileIndex(
-  sparkSession, metastoreRelation.catalogTable, sizeInBytes)
+val index = new CatalogFileIndex(sparkSession, catalogTable, 
sizeInBytes)
 if (lazyPruningEnabled) {
   index
 } else {
   index.filterPartitions(Nil)  // materialize all the 
partitions in memory
 }
   }
   val partitionSchemaColumnNames = 
partitionSchema.map(_.name.toLowerCase).toSet
-  val dataSchema =
-StructType(metastoreSchema
+  val filteredMetastoreSchema = StructType(metastoreSchema
   .filterNot(field => 
partitionSchemaColumnNames.contains(field.name.toLowerCase)))
--- End diff --

I'll fix both of these


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101460558
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1802,4 +1806,118 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 val df2 = spark.read.option("PREfersdecimaL", "true").json(records)
 assert(df2.schema == schema)
   }
+
+  test("SPARK-18352: Parse normal multi-line JSON files (compressed)") {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  primitiveFieldAndType
+.toDF("value")
+.write
+.option("compression", "GzIp")
+.text(path)
+
+  assert(new File(path).listFiles().exists(_.getName.endsWith(".gz")))
+
+  val jsonDF = spark.read.option("wholeFile", true).json(path)
+  val jsonDir = new File(dir, "json").getCanonicalPath
+  jsonDF.coalesce(1).write
+.format("json")
+.option("compression", "gZiP")
+.save(jsonDir)
+
+  assert(new 
File(jsonDir).listFiles().exists(_.getName.endsWith(".json.gz")))
+
+  val jsonCopy = spark.read
+.format("json")
+.load(jsonDir)
+
+  assert(jsonCopy.count === jsonDF.count)
+  val jsonCopySome = jsonCopy.selectExpr("string", "long", "boolean")
+  val jsonDFSome = jsonDF.selectExpr("string", "long", "boolean")
--- End diff --

Actually, it only covers three columns.
```
root
 |-- bigInteger: decimal(20,0) (nullable = true)
 |-- boolean: boolean (nullable = true)
 |-- double: double (nullable = true)
 |-- integer: long (nullable = true)
 |-- long: long (nullable = true)
 |-- null: string (nullable = true)
 |-- string: string (nullable = true)

root
 |-- bigInteger: decimal(20,0) (nullable = true)
 |-- boolean: boolean (nullable = true)
 |-- double: double (nullable = true)
 |-- integer: long (nullable = true)
 |-- long: long (nullable = true)
 |-- string: string (nullable = true)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16776
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16776
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72983/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread budde
Github user budde commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101460565
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -296,6 +296,17 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_SCHEMA_INFERENCE_MODE = 
buildConf("spark.sql.hive.schemaInferenceMode")
+.doc("Configures the action to take when a case-sensitive schema 
cannot be read from a Hive " +
+  "table's properties. Valid options include INFER_AND_SAVE (infer the 
case-sensitive " +
+  "schema from the underlying data files and write it back to the 
table properties), " +
+  "INFER_ONLY (infer the schema but don't attempt to write it to the 
table properties) and " +
+  "NEVER_INFER (fallback to using the case-insensitive metastore 
schema instead of inferring).")
+.stringConf
+.transform(_.toUpperCase())
+.checkValues(Set("INFER_AND_SAVE", "INFER_ONLY", "NEVER_INFER"))
+.createWithDefault("INFER_AND_SAVE")
--- End diff --

This was proposed in #16797 but I'd like to open this for discussion.
- ```INFER_ONLY``` would mimic the pre-2.1.0 behavior.
- ```INFER_AND_SAVE``` would attempt to prevent future inferences but may 
fail if the Hive client doesn't have write permissions on the metastore. 
- ```NEVER_INFER``` is the current behavior in 2.1.0 which breaks support 
with the tables affected by 
[SPARK-19611](https://issues.apache.org/jira/browse/SPARK-19611). Users may 
wish to enable this mode for tables without the table properties schema that 
they know are case-insensitive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16776
  
**[Test build #72983 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72983/testReport)**
 for PR 16776 at commit 
[`2268d36`](https://github.com/apache/spark/commit/2268d360206fd3e262d316ba3e02b35a525796da).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101460391
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1764,4 +1769,117 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 val df2 = spark.read.option("PREfersdecimaL", "true").json(records)
 assert(df2.schema == schema)
   }
+
+  test("SPARK-18352: Parse normal multi-line JSON files (compressed)") {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  primitiveFieldAndType
+.toDF("value")
+.write
+.option("compression", "GzIp")
+.text(path)
+
+  assert(new File(path).listFiles().exists(_.getName.endsWith(".gz")))
+
+  val jsonDF = spark.read.option("wholeFile", true).json(path)
--- End diff --

I have the same concern. We need to check whether the read data are 
expected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16895: [SPARK-15615][SQL] Add an API to load DataFrame f...

2017-02-15 Thread pjfanning
Github user pjfanning commented on a diff in the pull request:

https://github.com/apache/spark/pull/16895#discussion_r101460138
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1364,10 +1364,11 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 })
   }
 
-  test("SPARK-6245 JsonRDD.inferSchema on empty RDD") {
+  test("SPARK-6245 JsonRDD.inferSchema on empty Dataset") {
 // This is really a test that it doesn't throw an exception
+val emptyDataset = spark.createDataset(empty)(Encoders.STRING)
--- End diff --

RDD[_] only has toDS() function added when SQLImplicits applies an implicit 
conversion to wrap the RDD as a DatasetHolder.
import sparkSession.sqlContext.implicits._



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16947: [SPARK-19617][SS][WIP]Don't interrupt 'mkdirs' to workar...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16947
  
**[Test build #72988 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72988/testReport)**
 for PR 16947 at commit 
[`fb27a97`](https://github.com/apache/spark/commit/fb27a97b148d5e074227760ae83d6b2e95520ed7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16953: [SPARK-19622][WebUI]Fix a http error in a paged table wh...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16953
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16953: [SPARK-19622][WebUI]Fix a http error in a paged t...

2017-02-15 Thread stanzhai
GitHub user stanzhai opened a pull request:

https://github.com/apache/spark/pull/16953

[SPARK-19622][WebUI]Fix a http error in a paged table when using a `Go` 
button to search.

## What changes were proposed in this pull request?

The search function of paged table is not available because of we don't 
skip the hash data of the reqeust path. 


![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png)

## How was this patch tested?

Tested manually with my browser.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/stanzhai/spark fix-webui-paged-table

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16953


commit a4364dace3a8305f5ef7627ce68973bf7b7f7c6b
Author: Stan Zhai 
Date:   2017-02-16T06:17:54Z

fixed a pagination bug of paged table.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101457828
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  // Create a CatalogTable instance modeling an external Hive table in a 
metastore that isn't
+  // controlled by Spark (i.e. has no Spark-specific table properties set).
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")))
+
+  // Creates a case-sensitive external Hive table for testing schema 
inference options. Table
+  // will not have Spark-specific table properties set.
+  private def setupCaseSensitiveTable(
+  tableName: String,
+  dir: File): Unit = {
+spark.range(NUM_RECORDS)
+  .selectExpr("id as fieldOne", "id as partCol1", "id as partCol2")
+  .write
+  .partitionBy("partCol1", "partCol2")
+  .mode("overwrite")
+  .parquet(dir.getAbsolutePath)
+
+val lowercaseSchema = StructType(Seq(
+  StructField("fieldone", LongType),
+  StructField("partcol1", IntegerType),
+  StructField("partcol2", IntegerType)))
+
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+val catalogTable = 

[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101457755
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  // Create a CatalogTable instance modeling an external Hive table in a 
metastore that isn't
+  // controlled by Spark (i.e. has no Spark-specific table properties set).
+  private def hiveExternalCatalogTable(
+  tableName: String,
+  location: String,
+  schema: StructType,
+  partitionColumns: Seq[String],
+  properties: Map[String, String] = Map.empty): CatalogTable = {
+CatalogTable(
+  identifier = TableIdentifier(table = tableName, database = 
Option("default")),
+  tableType = CatalogTableType.EXTERNAL,
+  storage = CatalogStorageFormat(
+locationUri = Option(location),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")),
+  schema = schema,
+  provider = Option("hive"),
+  partitionColumnNames = partitionColumns,
+  properties = properties)
+  }
+
+  // Creates CatalogTablePartition instances for adding partitions of data 
to our test table.
+  private def hiveCatalogPartition(location: String, index: Int): 
CatalogTablePartition
+= CatalogTablePartition(
+  spec = Map("partcol1" -> index.toString, "partcol2" -> 
index.toString),
+  storage = CatalogStorageFormat(
+locationUri = 
Option(s"${location}/partCol1=$index/partCol2=$index/"),
+inputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"),
+outputFormat = 
Option("org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"),
+serde = 
Option("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"),
+compressed = false,
+properties = Map("serialization.format" -> "1")))
+
+  // Creates a case-sensitive external Hive table for testing schema 
inference options. Table
+  // will not have Spark-specific table properties set.
+  private def setupCaseSensitiveTable(
+  tableName: String,
+  dir: File): Unit = {
+spark.range(NUM_RECORDS)
+  .selectExpr("id as fieldOne", "id as partCol1", "id as partCol2")
+  .write
+  .partitionBy("partCol1", "partCol2")
+  .mode("overwrite")
+  .parquet(dir.getAbsolutePath)
+
+val lowercaseSchema = StructType(Seq(
+  StructField("fieldone", LongType),
+  StructField("partcol1", IntegerType),
+  StructField("partcol2", IntegerType)))
+
+val client = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+
+val catalogTable = 

[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101457348
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSchemaInferenceSuite.scala
 ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import java.io.File
+import java.util.concurrent.{Executors, TimeUnit}
+
+import org.scalatest.BeforeAndAfterEach
+
+import org.apache.spark.metrics.source.HiveCatalogMetrics
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.catalog._
+import org.apache.spark.sql.execution.datasources.FileStatusCache
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.hive.client.HiveClient
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+import org.apache.spark.sql.types._
+
+class HiveSchemaInferenceSuite
+  extends QueryTest with TestHiveSingleton with SQLTestUtils with 
BeforeAndAfterEach {
+
+  import HiveSchemaInferenceSuite._
+
+  // Create a CatalogTable instance modeling an external Hive table in a 
metastore that isn't
+  // controlled by Spark (i.e. has no Spark-specific table properties set).
--- End diff --

Is this comment `has no Spark-specific table properties set` accurate? As 
the `properties` is actually given by passing in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16895: [SPARK-15615][SQL] Add an API to load DataFrame f...

2017-02-15 Thread pjfanning
Github user pjfanning commented on a diff in the pull request:

https://github.com/apache/spark/pull/16895#discussion_r101456955
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1364,10 +1364,11 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 })
   }
 
-  test("SPARK-6245 JsonRDD.inferSchema on empty RDD") {
+  test("SPARK-6245 JsonRDD.inferSchema on empty Dataset") {
 // This is really a test that it doesn't throw an exception
+val emptyDataset = spark.createDataset(empty)(Encoders.STRING)
--- End diff --

I can double check but the toDS call appears to require the spark implicits 
import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101456952
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -161,23 +161,49 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   bucketSpec,
   Some(partitionSchema))
 
+val catalogTable = metastoreRelation.catalogTable
 val logicalRelation = cached.getOrElse {
   val sizeInBytes =
 
metastoreRelation.stats(sparkSession.sessionState.conf).sizeInBytes.toLong
   val fileIndex = {
-val index = new CatalogFileIndex(
-  sparkSession, metastoreRelation.catalogTable, sizeInBytes)
+val index = new CatalogFileIndex(sparkSession, catalogTable, 
sizeInBytes)
 if (lazyPruningEnabled) {
   index
 } else {
   index.filterPartitions(Nil)  // materialize all the 
partitions in memory
 }
   }
   val partitionSchemaColumnNames = 
partitionSchema.map(_.name.toLowerCase).toSet
-  val dataSchema =
-StructType(metastoreSchema
+  val filteredMetastoreSchema = StructType(metastoreSchema
   .filterNot(field => 
partitionSchemaColumnNames.contains(field.name.toLowerCase)))
--- End diff --

wrong indent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101456890
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -161,23 +161,49 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   bucketSpec,
   Some(partitionSchema))
 
+val catalogTable = metastoreRelation.catalogTable
 val logicalRelation = cached.getOrElse {
   val sizeInBytes =
 
metastoreRelation.stats(sparkSession.sessionState.conf).sizeInBytes.toLong
   val fileIndex = {
-val index = new CatalogFileIndex(
-  sparkSession, metastoreRelation.catalogTable, sizeInBytes)
+val index = new CatalogFileIndex(sparkSession, catalogTable, 
sizeInBytes)
 if (lazyPruningEnabled) {
   index
 } else {
   index.filterPartitions(Nil)  // materialize all the 
partitions in memory
 }
   }
   val partitionSchemaColumnNames = 
partitionSchema.map(_.name.toLowerCase).toSet
-  val dataSchema =
-StructType(metastoreSchema
+  val filteredMetastoreSchema = StructType(metastoreSchema
   .filterNot(field => 
partitionSchemaColumnNames.contains(field.name.toLowerCase)))
 
+  val inferenceMode = 
sparkSession.sessionState.conf.schemaInferenceMode
+  val dataSchema = if (inferenceMode != "NEVER_INFER" &&
+  !catalogTable.schemaFromTableProps) {
+val fileStatuses = fileIndex.listFiles(Nil).flatMap(_.files)
+val inferred = defaultSource.inferSchema(sparkSession, 
options, fileStatuses)
+val merged = if (fileType.equals("parquet")) {
+  
inferred.map(ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, _))
+} else {
+  inferred
+}
+if (inferenceMode == "INFER_AND_SAVE") {
+  // If a case-sensitive schema was successfully inferred, 
execute an alterTable
+  // operation to save the schema to the table properties.
+  merged.foreach { mergedSchema =>
+  val updatedTable = catalogTable.copy(schema = 
mergedSchema)
--- End diff --

Wrong indent style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16729: [SPARK-19391][SparkR][ML] Tweedie GLM API for SparkR

2017-02-15 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/16729
  
@felixcheung Thanks for the discussions. Will work on this in two weeks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101456600
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -161,23 +161,49 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   bucketSpec,
   Some(partitionSchema))
 
+val catalogTable = metastoreRelation.catalogTable
 val logicalRelation = cached.getOrElse {
   val sizeInBytes =
 
metastoreRelation.stats(sparkSession.sessionState.conf).sizeInBytes.toLong
   val fileIndex = {
-val index = new CatalogFileIndex(
-  sparkSession, metastoreRelation.catalogTable, sizeInBytes)
+val index = new CatalogFileIndex(sparkSession, catalogTable, 
sizeInBytes)
 if (lazyPruningEnabled) {
   index
 } else {
   index.filterPartitions(Nil)  // materialize all the 
partitions in memory
 }
   }
   val partitionSchemaColumnNames = 
partitionSchema.map(_.name.toLowerCase).toSet
-  val dataSchema =
-StructType(metastoreSchema
+  val filteredMetastoreSchema = StructType(metastoreSchema
   .filterNot(field => 
partitionSchemaColumnNames.contains(field.name.toLowerCase)))
 
+  val inferenceMode = 
sparkSession.sessionState.conf.schemaInferenceMode
+  val dataSchema = if (inferenceMode != "NEVER_INFER" &&
+  !catalogTable.schemaFromTableProps) {
+val fileStatuses = fileIndex.listFiles(Nil).flatMap(_.files)
+val inferred = defaultSource.inferSchema(sparkSession, 
options, fileStatuses)
--- End diff --

Can we log info that we are going to infer schema (and save it to 
metastore)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101456277
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -186,8 +212,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 fileFormat = defaultSource,
 options = options)(sparkSession = sparkSession)
 
-  val created = LogicalRelation(relation,
-catalogTable = Some(metastoreRelation.catalogTable))
+  val created = LogicalRelation(relation, catalogTable = 
Some(catalogTable))
--- End diff --

Once catalog table info is altered, shall we use the updated catalog table?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16944#discussion_r101454895
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -296,6 +296,17 @@ object SQLConf {
   .longConf
   .createWithDefault(250 * 1024 * 1024)
 
+  val HIVE_SCHEMA_INFERENCE_MODE = 
buildConf("spark.sql.hive.schemaInferenceMode")
+.doc("Configures the action to take when a case-sensitive schema 
cannot be read from a Hive " +
+  "table's properties. Valid options include INFER_AND_SAVE (infer the 
case-sensitive " +
+  "schema from the underlying data files and write it back to the 
table properties), " +
+  "INFER_ONLY (infer the schema but don't attempt to write it to the 
table properties) and " +
+  "NEVER_INFER (fallback to using the case-insensitive metastore 
schema instead of inferring).")
+.stringConf
+.transform(_.toUpperCase())
+.checkValues(Set("INFER_AND_SAVE", "INFER_ONLY", "NEVER_INFER"))
+.createWithDefault("INFER_AND_SAVE")
--- End diff --

Is `INFER_AND_SAVE` a good default value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16895: [SPARK-15615][SQL] Add an API to load DataFrame f...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16895#discussion_r101454773
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala
 ---
@@ -1364,10 +1364,11 @@ class JsonSuite extends QueryTest with 
SharedSQLContext with TestJsonData {
 })
   }
 
-  test("SPARK-6245 JsonRDD.inferSchema on empty RDD") {
+  test("SPARK-6245 JsonRDD.inferSchema on empty Dataset") {
 // This is really a test that it doesn't throw an exception
+val emptyDataset = spark.createDataset(empty)(Encoders.STRING)
--- End diff --

doesn't `empty.toDS` work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16948: [SPARK-19618][SQL] Inconsistency wrt max. buckets...

2017-02-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16948


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-15 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15125
  
I think @mallman is saying he would merge changes to @dding3 branch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-19619][SPARKR] SparkR approxQuantile suppo...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101454231
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
 #'
 #' @param x A SparkDataFrame.
-#' @param col The name of the numerical column.
+#' @param cols The names of the numerical columns.
 #' @param probabilities A list of quantile probabilities. Each number must 
belong to [0, 1].
 #'  For example 0 is the minimum, 0.5 is the median, 1 
is the maximum.
 #' @param relativeError The relative target precision to achieve (>= 0). 
If set to zero,
 #'  the exact quantiles are computed, which could be 
very expensive.
 #'  Note that values greater than 1 are accepted but 
give the same result as 1.
-#' @return The approximate quantiles at the given probabilities.
+#' @return The approximate quantiles at the given probabilities. The 
output should be a list,
+#' and each element in it is a list of numeric values which 
represents the approximate
+#' quantiles in corresponding column.
--- End diff --

another alternative is to convert the results back into a vector of length 
1 iff cols is length of 1, which matches the previous behavior


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-19619][SPARKR] SparkR approxQuantile suppo...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101454291
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
--- End diff --

it would be good to test that too...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16948: [SPARK-19618][SQL] Inconsistency wrt max. buckets allowe...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16948
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-19619][SPARKR] SparkR approxQuantile suppo...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101454138
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
 #'
 #' @param x A SparkDataFrame.
-#' @param col The name of the numerical column.
+#' @param cols The names of the numerical columns.
 #' @param probabilities A list of quantile probabilities. Each number must 
belong to [0, 1].
 #'  For example 0 is the minimum, 0.5 is the median, 1 
is the maximum.
 #' @param relativeError The relative target precision to achieve (>= 0). 
If set to zero,
 #'  the exact quantiles are computed, which could be 
very expensive.
 #'  Note that values greater than 1 are accepted but 
give the same result as 1.
-#' @return The approximate quantiles at the given probabilities.
+#' @return The approximate quantiles at the given probabilities. The 
output should be a list,
+#' and each element in it is a list of numeric values which 
represents the approximate
+#' quantiles in corresponding column.
--- End diff --

this would break the output format - I'm not sure if we should do this 
in-place, rather than a new signature/return value for multiple columns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16386: [SPARK-18352][SQL] Support parsing multiline json files

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/16386
  
LGTM if the test can pass. It will be good if you can also address 
https://github.com/apache/spark/pull/16386#discussion_r100679183


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16909: [SPARK-13450] Introduce ExternalAppendOnlyUnsafeR...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16909#discussion_r101454158
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArray.scala
 ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import java.util.ConcurrentModificationException
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.catalyst.expressions.UnsafeRow
+import 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.DefaultInitialSizeOfInMemoryBuffer
+import org.apache.spark.util.collection.unsafe.sort.{UnsafeExternalSorter, 
UnsafeSorterIterator}
+
+/**
+ * An append-only array for [[UnsafeRow]]s that spills content to disk 
when there a predefined
+ * threshold of rows is reached.
+ *
+ * Setting spill threshold faces following trade-off:
+ *
+ * - If the spill threshold is too high, the in-memory array may occupy 
more memory than is
+ *   available, resulting in OOM.
+ * - If the spill threshold is too low, we spill frequently and incur 
unnecessary disk writes.
+ *   This may lead to a performance regression compared to the normal case 
of using an
+ *   [[ArrayBuffer]] or [[Array]].
+ */
+private[sql] class ExternalAppendOnlyUnsafeRowArray(numRowsSpillThreshold: 
Int) extends Logging {
--- End diff --

From the comparison results, `ExternalUnsafeSorter` performs slightly 
better?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-19619][SPARKR] SparkR approxQuantile supports inp...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16951
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72985/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-19619][SPARKR] SparkR approxQuantile suppo...

2017-02-15 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16951#discussion_r101453993
  
--- Diff: R/pkg/R/stats.R ---
@@ -149,15 +149,18 @@ setMethod("freqItems", signature(x = 
"SparkDataFrame", cols = "character"),
 #' This method implements a variation of the Greenwald-Khanna algorithm 
(with some speed
 #' optimizations). The algorithm was first present in 
[[http://dx.doi.org/10.1145/375663.375670
 #' Space-efficient Online Computation of Quantile Summaries]] by Greenwald 
and Khanna.
+#' Note that rows containing any NA values will be removed before 
calculation.
 #'
 #' @param x A SparkDataFrame.
-#' @param col The name of the numerical column.
+#' @param cols The names of the numerical columns.
--- End diff --

we should be clear that this can be one or more columns


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16952: [SPARK-19620][SQL]Fix incorrect exchange coordinator id ...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16952
  
**[Test build #72987 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72987/testReport)**
 for PR 16952 at commit 
[`b2eb68a`](https://github.com/apache/spark/commit/b2eb68afee9673274122f241f5d9eb64142a509f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101453945
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -151,6 +153,8 @@ final class DataStreamReader private[sql](sparkSession: 
SparkSession) extends Lo
* 
* `maxFilesPerTrigger` (default: no max limit): sets the maximum 
number of new files to be
* considered in every trigger.
+   * `wholeFile` (default `false`): parse one record, which may span 
multiple lines,
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-19619][SPARKR] SparkR approxQuantile supports inp...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16951
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-19619][SPARKR] SparkR approxQuantile supports inp...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16951
  
**[Test build #72985 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72985/testReport)**
 for PR 16951 at commit 
[`4369f10`](https://github.com/apache/spark/commit/4369f104c9162f0834aa7b98205ab1747723f9e3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16952: [SPARK-19620][SQL]Fix incorrect exchange coordina...

2017-02-15 Thread carsonwang
GitHub user carsonwang opened a pull request:

https://github.com/apache/spark/pull/16952

[SPARK-19620][SQL]Fix incorrect exchange coordinator id in the physical plan

## What changes were proposed in this pull request?
When adaptive execution is enabled, an exchange coordinator is used in the 
Exchange operators. For Join, the same exchange coordinator is used for its two 
Exchanges. But the physical plan shows two different coordinator Ids which is 
confusing.

This PR is to fix the incorrect exchange coordinator id in the physical 
plan. The coordinator object instead of the `Option[ExchangeCoordinator]` 
should be used to generate the identity hash code of the same coordinator. 

## How was this patch tested?
manual tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/carsonwang/spark FixCoordinatorId

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16952.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16952


commit b2eb68afee9673274122f241f5d9eb64142a509f
Author: Carson Wang 
Date:   2017-02-16T06:22:45Z

Fix incorrect exchange coordinator id




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101453441
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -332,20 +336,21 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
* @since 1.4.0
*/
   def json(jsonRDD: RDD[String]): DataFrame = {
-val parsedOptions: JSONOptions =
-  new JSONOptions(extraOptions.toMap, 
sparkSession.sessionState.conf.sessionLocalTimeZone)
-val columnNameOfCorruptRecord =
-  parsedOptions.columnNameOfCorruptRecord
-
.getOrElse(sparkSession.sessionState.conf.columnNameOfCorruptRecord)
+val parsedOptions = new JSONOptions(extraOptions.toMap,
--- End diff --

nit: the style should be
```
new XXX(
  para1,
  para2,
  para3)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101453350
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -261,14 +261,18 @@ class DataFrameReader private[sql](sparkSession: 
SparkSession) extends Logging {
   }
 
   /**
-   * Loads a JSON file (http://jsonlines.org/;>JSON Lines text 
format or
-   * newline-delimited JSON) and returns the result as a `DataFrame`.
+   * Loads a JSON file and returns the results as a `DataFrame`.
+   *
+   * Both JSON (one record per file) and http://jsonlines.org/;>JSON Lines
+   * (newline-delimited JSON) are supported and can be selected with the 
`wholeFile` option.
*
* This function goes through the input once to determine the input 
schema. If you know the
* schema in advance, use the version that specifies the schema to avoid 
the extra scan.
*
* You can set the following JSON-specific options to deal with 
non-standard JSON files:
* 
+   * `wholeFile` (default `false`): parse one record, which may span 
multiple lines,
--- End diff --

please move it to the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16949
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16948: [SPARK-19618][SQL] Inconsistency wrt max. buckets allowe...

2017-02-15 Thread tejasapatil
Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/16948
  
@cloud-fan : this is as per our discussion in 
https://github.com/apache/spark/pull/16931#discussion_r101170361


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-15 Thread kevinyu98
Github user kevinyu98 commented on the issue:

https://github.com/apache/spark/pull/16915
  
@gatorsmile thanks a lot.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101451711
  
--- Diff: python/pyspark/sql/streaming.py ---
@@ -442,6 +444,8 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
 :param path: string represents path to the JSON dataset,
  or RDD of Strings storing JSON objects.
 :param schema: an optional :class:`pyspark.sql.types.StructType` 
for the input schema.
+:param wholeFile: parse one record, which may span multiple lines, 
per file. If None is
--- End diff --

same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16386: [SPARK-18352][SQL] Support parsing multiline json...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16386#discussion_r101451677
  
--- Diff: python/pyspark/sql/readwriter.py ---
@@ -159,18 +159,21 @@ def json(self, path, schema=None, 
primitivesAsString=None, prefersDecimal=None,
  allowComments=None, allowUnquotedFieldNames=None, 
allowSingleQuotes=None,
  allowNumericLeadingZero=None, 
allowBackslashEscapingAnyCharacter=None,
  mode=None, columnNameOfCorruptRecord=None, dateFormat=None, 
timestampFormat=None,
- timeZone=None):
+ timeZone=None, wholeFile=None):
 """
-Loads a JSON file (`JSON Lines text format or newline-delimited 
JSON
-`_) or an RDD of Strings storing JSON 
objects (one object per
-record) and returns the result as a :class`DataFrame`.
+Loads a JSON file and returns the results as a :class:`DataFrame`.
+
+Both JSON (one record per file) and `JSON Lines 
`_
+(newline-delimited JSON) are supported and can be selected with 
the `wholeFile` parameter.
 
 If the ``schema`` parameter is not specified, this function goes
 through the input once to determine the input schema.
 
 :param path: string represents path to the JSON dataset,
  or RDD of Strings storing JSON objects.
 :param schema: an optional :class:`pyspark.sql.types.StructType` 
for the input schema.
+:param wholeFile: parse one record, which may span multiple lines, 
per file. If None is
--- End diff --

the parameters docs come with the same order of the parameter list, let's 
move the `wholeFile` doc to the end


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-15 Thread kayousterhout
Github user kayousterhout commented on the issue:

https://github.com/apache/spark/pull/16620
  
LGTM! Thanks for finding this subtle bug and all of the hard work to fix it 
@jinxing64. I'll wait until tomorrow to merge this to give Mark and Imran a 
chance for any last comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16949
  
**[Test build #72986 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72986/testReport)**
 for PR 16949 at commit 
[`a3d2abb`](https://github.com/apache/spark/commit/a3d2abbc05948a425fbeadb2af00438087f7eb58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16818: [SPARK-19451][SQL][Core] Underlying integer overf...

2017-02-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/16818#discussion_r101451202
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala ---
@@ -180,16 +180,20 @@ class WindowSpec private[sql](
   private def between(typ: FrameType, start: Long, end: Long): WindowSpec 
= {
 val boundaryStart = start match {
   case 0 => CurrentRow
-  case Long.MinValue => UnboundedPreceding
-  case x if x < 0 => ValuePreceding(-start.toInt)
-  case x if x > 0 => ValueFollowing(start.toInt)
+  case x if x < Int.MinValue => UnboundedPreceding
--- End diff --

cc @hvanhovell any ideas?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16949
  
 jenkins crushed. retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16951: [SPARK-19619][SPARKR] SparkR approxQuantile supports inp...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16951
  
**[Test build #72985 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72985/testReport)**
 for PR 16951 at commit 
[`4369f10`](https://github.com/apache/spark/commit/4369f104c9162f0834aa7b98205ab1747723f9e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16951: [SPARK-19619][SPARKR] SparkR approxQuantile suppo...

2017-02-15 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/16951

[SPARK-19619][SPARKR] SparkR approxQuantile supports input multiple columns

## What changes were proposed in this pull request?
SparkR ```approxQuantile``` supports input multiple columns.

## How was this patch tested?
Unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-19619

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16951.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16951


commit 4369f104c9162f0834aa7b98205ab1747723f9e3
Author: Yanbo Liang 
Date:   2017-02-16T05:50:46Z

SparkR approxQuantile support multiple columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16949
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72979/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16949
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16949
  
**[Test build #72979 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72979/testReport)**
 for PR 16949 at commit 
[`a3d2abb`](https://github.com/apache/spark/commit/a3d2abbc05948a425fbeadb2af00438087f7eb58).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16944
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72978/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16944
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16722: [SPARK-19591][ML][MLlib] Add sample weights to de...

2017-02-15 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16722#discussion_r101449037
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala ---
@@ -60,12 +68,14 @@ private[spark] object BaggedPoint {
   subsamplingRate: Double,
   numSubsamples: Int,
   withReplacement: Boolean,
+  extractSampleWeight: (Datum => Double) = (_: Datum) => 1.0,
   seed: Long = Utils.random.nextLong()): RDD[BaggedPoint[Datum]] = {
+// TODO: implement weighted bootstrapping
--- End diff --

There was some discussion on the JIRA about it. Actually, we may or may not 
do this, so I'll remove it in the next commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16944: [SPARK-19611][SQL] Introduce configurable table schema i...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16944
  
**[Test build #72978 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72978/testReport)**
 for PR 16944 at commit 
[`8ac3b04`](https://github.com/apache/spark/commit/8ac3b04653c29c91d727f51906fa7dd51c2d08b8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16722: [SPARK-19591][ML][MLlib] Add sample weights to de...

2017-02-15 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16722#discussion_r101448976
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala ---
@@ -82,16 +92,16 @@ private[spark] object BaggedPoint {
   val rng = new XORShiftRandom
   rng.setSeed(seed + partitionIndex + 1)
   instances.map { instance =>
-val subsampleWeights = new Array[Double](numSubsamples)
+val subsampleCounts = new Array[Int](numSubsamples)
 var subsampleIndex = 0
 while (subsampleIndex < numSubsamples) {
   val x = rng.nextDouble()
-  subsampleWeights(subsampleIndex) = {
-if (x < subsamplingRate) 1.0 else 0.0
+  subsampleCounts(subsampleIndex) = {
+if (x < subsamplingRate) 1 else 0
   }
   subsampleIndex += 1
 }
-new BaggedPoint(instance, subsampleWeights)
+new BaggedPoint(instance, subsampleCounts, 1.0)
--- End diff --

Sample weights for sampling with/without replacement are effectively not 
implemented yet. We'll need to think about it for RandomForest though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16722: [SPARK-19591][ML][MLlib] Add sample weights to de...

2017-02-15 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16722#discussion_r101448658
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -351,6 +370,36 @@ class DecisionTreeClassifierSuite
 dt.fit(df)
   }
 
+  test("training with sample weights") {
+val df = linearMulticlassDataset
+val numClasses = 3
+val predEquals = (x: Double, y: Double) => x == y
+// (impurity, maxDepth)
+val testParams = Seq(
+  ("gini", 10),
+  ("entropy", 10),
+  ("gini", 5)
+)
+for ((impurity, maxDepth) <- testParams) {
+  val estimator = new DecisionTreeClassifier()
+.setMaxDepth(maxDepth)
+.setSeed(seed)
+.setMinWeightFractionPerNode(0.049)
--- End diff --

`org.apache.spark.ml.param.ParamsSuite`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16722: [SPARK-19591][ML][MLlib] Add sample weights to de...

2017-02-15 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16722#discussion_r101448626
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/DecisionTreeMetadata.scala 
---
@@ -115,7 +122,10 @@ private[spark] object DecisionTreeMetadata extends 
Logging {
 }
 require(numFeatures > 0, s"DecisionTree requires number of features > 
0, " +
   s"but was given an empty features vector")
-val numExamples = input.count()
+val (numExamples, weightSum) = input.aggregate((0L, 0.0))(
--- End diff --

No, I haven't. I think it's very low risk, as you say.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16844: [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMa...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16844
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16844: [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMa...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16844
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72977/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16844: [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMa...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16844
  
**[Test build #72977 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72977/testReport)**
 for PR 16844 at commit 
[`8f098aa`](https://github.com/apache/spark/commit/8f098aa762a711dc6b8f915ae22887b61b641f49).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447842
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -65,6 +66,8 @@ class BlockManagerMasterEndpoint(
 mapper
   }
 
+  val proactivelyReplicate = 
conf.get("spark.storage.replication.proactive", "false").toBoolean
--- End diff --

Please document this new configuration in `docs/configuration.md`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101448019
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

Do we need to check if `bytes` is already a direct buffer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447797
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -195,17 +198,39 @@ class BlockManagerMasterEndpoint(
 
 // Remove it from blockManagerInfo and remove all the blocks.
 blockManagerInfo.remove(blockManagerId)
+
 val iterator = info.blocks.keySet.iterator
 while (iterator.hasNext) {
   val blockId = iterator.next
   val locations = blockLocations.get(blockId)
   locations -= blockManagerId
   if (locations.size == 0) {
 blockLocations.remove(blockId)
+logWarning(s"No more replicas available for $blockId !")
+  } else if (proactivelyReplicate && (blockId.isRDD || 
blockId.isInstanceOf[TestBlockId])) {
+// only RDD blocks store data that users explicitly cache so we 
only need to proactively
+// replicate RDD blocks
+// broadcast related blocks exist on all executors, so we don't 
worry about them
+// we also need to replicate this behavior for test blocks for 
unit tests
+// we send a message to a randomly chosen executor location to 
replicate block
+// assuming single executor failure, we find out how many replicas 
existed before failure
+val maxReplicas = locations.size + 1
+
+val i = (new Random(blockId.hashCode)).nextInt(locations.size)
+val blockLocations = locations.toSeq
+val candidateBMId = blockLocations(i)
+val blockManager = blockManagerInfo.get(candidateBMId)
+if(blockManager.isDefined) {
+  val remainingLocations = locations.toSeq.filter(bm => bm != 
candidateBMId)
--- End diff --

Is it possible for this list to be empty in certain corner-cases? What 
happens if `ReplicateBlock` is called with an empty set of locations? Is it 
just a no-op in that case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447703
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -195,17 +198,39 @@ class BlockManagerMasterEndpoint(
 
 // Remove it from blockManagerInfo and remove all the blocks.
 blockManagerInfo.remove(blockManagerId)
+
 val iterator = info.blocks.keySet.iterator
 while (iterator.hasNext) {
   val blockId = iterator.next
   val locations = blockLocations.get(blockId)
   locations -= blockManagerId
   if (locations.size == 0) {
 blockLocations.remove(blockId)
+logWarning(s"No more replicas available for $blockId !")
+  } else if (proactivelyReplicate && (blockId.isRDD || 
blockId.isInstanceOf[TestBlockId])) {
+// only RDD blocks store data that users explicitly cache so we 
only need to proactively
+// replicate RDD blocks
+// broadcast related blocks exist on all executors, so we don't 
worry about them
+// we also need to replicate this behavior for test blocks for 
unit tests
+// we send a message to a randomly chosen executor location to 
replicate block
+// assuming single executor failure, we find out how many replicas 
existed before failure
+val maxReplicas = locations.size + 1
+
+val i = (new Random(blockId.hashCode)).nextInt(locations.size)
+val blockLocations = locations.toSeq
+val candidateBMId = blockLocations(i)
+val blockManager = blockManagerInfo.get(candidateBMId)
+if(blockManager.isDefined) {
--- End diff --

If you're not going to have an `else` branch here then you might as well 
just `forEach` over the result of `blockManagerInfo.get(candidateBMId)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101446619
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1131,14 +1131,43 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   *
+   * @param blockId blockId being replicate
+   * @param existingReplicas existing block managers that have a replica
+   * @param maxReplicas maximum replicas needed
+   * @return
+   */
+  def replicateBlock(
+blockId: BlockId,
--- End diff --

Same as Sameer's comment elsewhere in the code, we should fix the 
indentation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by ...

2017-02-15 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16950
  
merged to branch-2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447299
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1131,14 +1131,43 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   *
+   * @param blockId blockId being replicate
+   * @param existingReplicas existing block managers that have a replica
+   * @param maxReplicas maximum replicas needed
+   * @return
+   */
+  def replicateBlock(
+blockId: BlockId,
+existingReplicas: Set[BlockManagerId],
+maxReplicas: Int): Unit = {
+logInfo(s"Pro-actively replicating $blockId")
+val infoForReplication = blockInfoManager.lockForReading(blockId).map 
{ info =>
--- End diff --

This call acquires a read lock on the block, but when is that lock 
released? Per the Scaladoc of `doGetLocalBytes`, you need to be holding a read 
lock before calling that method, but upon successful return from that method 
the read lock will still be held by the caller.

I think what you want to do is acquire the lock, immediately call 
`doGetLocalBytes`, then begin a `try-finally` statement to call `replicate()` 
and unlock / release the lock in the `finally` block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests bro...

2017-02-15 Thread felixcheung
Github user felixcheung closed the pull request at:

https://github.com/apache/spark/pull/16950


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447394
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1131,14 +1131,43 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   *
+   * @param blockId blockId being replicate
+   * @param existingReplicas existing block managers that have a replica
+   * @param maxReplicas maximum replicas needed
+   * @return
+   */
+  def replicateBlock(
+blockId: BlockId,
+existingReplicas: Set[BlockManagerId],
+maxReplicas: Int): Unit = {
+logInfo(s"Pro-actively replicating $blockId")
+val infoForReplication = blockInfoManager.lockForReading(blockId).map 
{ info =>
--- End diff --

Also, I don't think there's a need to have separate `.map` and `.foreach` 
calls over the option. Instead, I think it would be clearer to avoid the 
assignment to the `infoForReplication` variable and just perform all of the 
work inside of a `.foreach` call on the `Option` with the block info.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101446632
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1131,14 +1131,43 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   *
+   * @param blockId blockId being replicate
+   * @param existingReplicas existing block managers that have a replica
+   * @param maxReplicas maximum replicas needed
+   * @return
--- End diff --

You can omit this `@return` since this method doesn't have a return value.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101446759
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -195,17 +198,39 @@ class BlockManagerMasterEndpoint(
 
 // Remove it from blockManagerInfo and remove all the blocks.
 blockManagerInfo.remove(blockManagerId)
+
 val iterator = info.blocks.keySet.iterator
 while (iterator.hasNext) {
   val blockId = iterator.next
   val locations = blockLocations.get(blockId)
   locations -= blockManagerId
   if (locations.size == 0) {
 blockLocations.remove(blockId)
+logWarning(s"No more replicas available for $blockId !")
+  } else if (proactivelyReplicate && (blockId.isRDD || 
blockId.isInstanceOf[TestBlockId])) {
+// only RDD blocks store data that users explicitly cache so we 
only need to proactively
+// replicate RDD blocks
+// broadcast related blocks exist on all executors, so we don't 
worry about them
+// we also need to replicate this behavior for test blocks for 
unit tests
+// we send a message to a randomly chosen executor location to 
replicate block
+// assuming single executor failure, we find out how many replicas 
existed before failure
+val maxReplicas = locations.size + 1
+
+val i = (new Random(blockId.hashCode)).nextInt(locations.size)
+val blockLocations = locations.toSeq
+val candidateBMId = blockLocations(i)
+val blockManager = blockManagerInfo.get(candidateBMId)
+if(blockManager.isDefined) {
--- End diff --

Nit: space after `if`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101446961
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1131,14 +1131,43 @@ private[spark] class BlockManager(
   }
 
   /**
+   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   *
+   * @param blockId blockId being replicate
+   * @param existingReplicas existing block managers that have a replica
+   * @param maxReplicas maximum replicas needed
+   * @return
+   */
+  def replicateBlock(
+blockId: BlockId,
+existingReplicas: Set[BlockManagerId],
+maxReplicas: Int): Unit = {
+logInfo(s"Pro-actively replicating $blockId")
+val infoForReplication = blockInfoManager.lockForReading(blockId).map 
{ info =>
+  val data = doGetLocalBytes(blockId, info)
+  val storageLevel = StorageLevel(
+info.level.useDisk,
--- End diff --

Minor nit, but a problem with the `StorageLevel` constructor is that it has 
a bunch of adjacent boolean parameters, so in such cases I'd usually prefer to 
name all of the parameters explicitly at the call site in order to avoid errors 
should these lines get permuted / to convince readers that the API is being 
used correctly.

Thus I'd probably write each line like `useDisk = info.level.useDisk,` etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447547
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -195,17 +198,39 @@ class BlockManagerMasterEndpoint(
 
 // Remove it from blockManagerInfo and remove all the blocks.
 blockManagerInfo.remove(blockManagerId)
+
 val iterator = info.blocks.keySet.iterator
 while (iterator.hasNext) {
   val blockId = iterator.next
   val locations = blockLocations.get(blockId)
   locations -= blockManagerId
   if (locations.size == 0) {
 blockLocations.remove(blockId)
+logWarning(s"No more replicas available for $blockId !")
+  } else if (proactivelyReplicate && (blockId.isRDD || 
blockId.isInstanceOf[TestBlockId])) {
+// only RDD blocks store data that users explicitly cache so we 
only need to proactively
--- End diff --

+1 on Sameer's suggestions. This code is a little subtle and benefits from 
a clearer comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/N...

2017-02-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16915


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14412: [SPARK-15355] [CORE] Proactive block replication

2017-02-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14412#discussion_r101447632
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala 
---
@@ -195,17 +198,39 @@ class BlockManagerMasterEndpoint(
 
 // Remove it from blockManagerInfo and remove all the blocks.
 blockManagerInfo.remove(blockManagerId)
+
 val iterator = info.blocks.keySet.iterator
 while (iterator.hasNext) {
   val blockId = iterator.next
   val locations = blockLocations.get(blockId)
   locations -= blockManagerId
   if (locations.size == 0) {
 blockLocations.remove(blockId)
+logWarning(s"No more replicas available for $blockId !")
+  } else if (proactivelyReplicate && (blockId.isRDD || 
blockId.isInstanceOf[TestBlockId])) {
+// only RDD blocks store data that users explicitly cache so we 
only need to proactively
+// replicate RDD blocks
+// broadcast related blocks exist on all executors, so we don't 
worry about them
+// we also need to replicate this behavior for test blocks for 
unit tests
+// we send a message to a randomly chosen executor location to 
replicate block
+// assuming single executor failure, we find out how many replicas 
existed before failure
+val maxReplicas = locations.size + 1
+
+val i = (new Random(blockId.hashCode)).nextInt(locations.size)
--- End diff --

Why do we need to use a fixed random seed here? Testing?

Also, isn't there a `Random.choice()` that you can use for this? Or a 
method like that in our own `Utils` class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by ...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16950
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72982/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by ...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16950
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16915
  
LGTM. Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by ...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16950
  
**[Test build #72982 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72982/testReport)**
 for PR 16950 at commit 
[`2b98d74`](https://github.com/apache/spark/commit/2b98d749167442995fe28b4c5fe8b8220ce22643).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16841: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16841
  
**[Test build #72984 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72984/testReport)**
 for PR 16841 at commit 
[`850aacd`](https://github.com/apache/spark/commit/850aacdec86a1dc1ffc4c4f1b77f828f4aa1078f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101447383
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Do we need to call dispose on on-head byte buffer? I think only off-heap 
byte buffer needs to be disposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16841: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16841
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16776: [SPARK-19436][SQL] Add missing tests for approxQuantile

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16776
  
**[Test build #72983 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72983/testReport)**
 for PR 16776 at commit 
[`2268d36`](https://github.com/apache/spark/commit/2268d360206fd3e262d316ba3e02b35a525796da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15468: [SPARK-17915][SQL] Prepare a new ColumnVector implementa...

2017-02-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/15468
  
@sameeragarwal  would it be possible to review this since I resolved a 
conflict?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16945: [SPARK-19616][SparkR]:weightCol and aggregationDepth sho...

2017-02-15 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/16945
  
could you add some tests (I don't think we need one for each) that the 
weight col is set properly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16620
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72974/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16620
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16620
  
**[Test build #72974 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72974/testReport)**
 for PR 16620 at commit 
[`6809d1f`](https://github.com/apache/spark/commit/6809d1ff5d09693e961087da35c8f6b3b50fe53c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16934: [SPARK-19603][SS]Fix StreamingQuery explain comma...

2017-02-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16934


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16934: [SPARK-19603][SS]Fix StreamingQuery explain command

2017-02-15 Thread zsxwing
Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/16934
  
Thanks! Merging to master and 2.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16950: [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by ...

2017-02-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16950
  
**[Test build #72982 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72982/testReport)**
 for PR 16950 at commit 
[`2b98d74`](https://github.com/apache/spark/commit/2b98d749167442995fe28b4c5fe8b8220ce22643).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16512: [SPARK-18335][SPARKR] createDataFrame to support numPart...

2017-02-15 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/16512
  
At least from the R code perspective this is source code compatible with 
existing 2.1, as adding an optional parameter at the end should not break any 
existing code. Also I am not sure I would call this a new API as its adding an 
option to an existing function ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >