[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
To clarify, we can look at changing the behavior (if required) in future - 
but that should be an explicit design choice informed by hadoop committer 
design. Until then, we should look to interoperate.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82784/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #82784 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82784/testReport)**
 for PR 19218 at commit 
[`dfb36d9`](https://github.com/apache/spark/commit/dfb36d9dd040e19babdb17c4b8396654ebf18787).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

2017-10-15 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/19497
  
@HyukjinKwon My intention was to preserve earlier behavior.
Particularly for non-path based committer's, the `path` variable and its 
use/processing is not relevant, it makes more sense to ignore that codepath 
entirely.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #82784 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82784/testReport)**
 for PR 19218 at commit 
[`dfb36d9`](https://github.com/apache/spark/commit/dfb36d9dd040e19babdb17c4b8396654ebf18787).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-15 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19500#discussion_r144759023
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
-val parquetTable = "parquetTable"
-val orcTable = "orcTable"
-withTable(parquetTable, orcTable) {
-  sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED 
AS PARQUET")
-  sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS 
ORC")
-  sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src")
-  sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src")
-
-  // the default value for `spark.sql.hive.convertMetastoreParquet` is 
true, here we just set it
-  // for robustness
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") {
-checkTableStats(parquetTable, hasSizeInBytes = false, 
expectedRowCounts = None)
-sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS")
-checkTableStats(parquetTable, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
-  }
-  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
-// We still can get tableSize from Hive before Analyze
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= None)
-sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS")
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= Some(500))
+Seq("orc", "parquet").foreach { format =>
--- End diff --

Maybe "orcTbl" and "parquetTbl" are better?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-15 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19500#discussion_r144759077
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
-val parquetTable = "parquetTable"
-val orcTable = "orcTable"
-withTable(parquetTable, orcTable) {
-  sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED 
AS PARQUET")
-  sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS 
ORC")
-  sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src")
-  sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src")
-
-  // the default value for `spark.sql.hive.convertMetastoreParquet` is 
true, here we just set it
-  // for robustness
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") {
-checkTableStats(parquetTable, hasSizeInBytes = false, 
expectedRowCounts = None)
-sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS")
-checkTableStats(parquetTable, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
-  }
-  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
-// We still can get tableSize from Hive before Analyze
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= None)
--- End diff --

Could you explain why orc table has size before analyze command while 
parquet table does not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-15 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19500#discussion_r144759141
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
-val parquetTable = "parquetTable"
-val orcTable = "orcTable"
-withTable(parquetTable, orcTable) {
-  sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED 
AS PARQUET")
-  sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS 
ORC")
-  sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src")
-  sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src")
-
-  // the default value for `spark.sql.hive.convertMetastoreParquet` is 
true, here we just set it
-  // for robustness
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") {
-checkTableStats(parquetTable, hasSizeInBytes = false, 
expectedRowCounts = None)
-sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS")
-checkTableStats(parquetTable, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
-  }
-  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
-// We still can get tableSize from Hive before Analyze
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= None)
-sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS")
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= Some(500))
+Seq("orc", "parquet").foreach { format =>
+  Seq(true, false).foreach { isConverted =>
+withSQLConf(
+  HiveUtils.CONVERT_METASTORE_ORC.key -> s"$isConverted",
+  HiveUtils.CONVERT_METASTORE_PARQUET.key -> s"$isConverted") {
+  withTable(format) {
+sql(s"CREATE TABLE $format (key STRING, value STRING) STORED 
AS $format")
+sql(s"INSERT INTO TABLE $format SELECT * FROM src")
+
+val hasHiveStats = !isConverted
--- End diff --

we can just inline this val


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19501
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog replaying...

2017-10-15 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19482
  
Cloud you help review this? @vanzin Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...

2017-10-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19501


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82778/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82778 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82778/testReport)**
 for PR 19502 at commit 
[`a554112`](https://github.com/apache/spark/commit/a554112daa60a130074709c77bcdffbd443a7c2f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82783 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82783/testReport)**
 for PR 19502 at commit 
[`02792d9`](https://github.com/apache/spark/commit/02792d953872ed6d718aa9ceb247affee5004b66).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19458#discussion_r144758165
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, 
deleteFilesOnStop: Boolea
 
   /** List all the blocks currently stored on disk by the disk manager. */
   def getAllBlocks(): Seq[BlockId] = {
-getAllFiles().map(f => BlockId(f.getName))
+getAllFiles().flatMap { f =>
+  val blockId = BlockId.guess(f.getName)
--- End diff --

I think we don't need to log here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
+checkAnswer(
+  dfInt.select(from_json($"value", ArrayType(IntegerType))),
+  Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
+
+val dfString = Seq("""["hello", "world", ""]""").toDS()
+checkAnswer(
+  dfString.select(from_json($"value", ArrayType(StringType))),
+  Row(Seq("hello", "world", "")):: Nil)
+
+val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
+val schema = ArrayType(TimestampType)
+val options = Map("timestampFormat" -> "dd/MM/ HH:mm")
+
+checkAnswer(
+  dfTimestamp.select(from_json($"value", schema, options)),
+  Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"
+
+val dfEmpty = Seq("""[]""").toDS()
+checkAnswer(
+  dfEmpty.select(from_json($"value", ArrayType(StringType))),
+  Row(Nil):: Nil)
--- End diff --

And also add a case that can fail the parsing like `Seq([1]", "[2, 3]", 
"[string]""").toDS()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19317#discussion_r144757881
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging 
values within a
* partition, and the latter is used for merging values between 
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return 
their first argument
+   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
+   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
+   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
--- End diff --

what's the difference between `aggregateByKeyLocally` and 
`aggregateByKey(...).toLocalIterator`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18747#discussion_r144757619
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -23,21 +23,53 @@ import org.apache.spark.sql.catalyst.dsl.expressions._
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.plans.QueryPlan
 import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, 
Partitioning}
-import org.apache.spark.sql.execution.LeafExecNode
-import org.apache.spark.sql.execution.metric.SQLMetrics
+import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode}
+import org.apache.spark.sql.execution.vectorized._
 import org.apache.spark.sql.types.UserDefinedType
 
 
 case class InMemoryTableScanExec(
 attributes: Seq[Attribute],
 predicates: Seq[Expression],
 @transient relation: InMemoryRelation)
-  extends LeafExecNode {
+  extends LeafExecNode with ColumnarBatchScan {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) 
++ super.innerChildren
 
-  override lazy val metrics = Map(
-"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of 
output rows"))
+  override def vectorTypes: Option[Seq[String]] =
+
Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName))
+
+  override val columnIndexes =
+attributes.map(a => relation.output.map(o => 
o.exprId).indexOf(a.exprId)).toArray
+
+  override val supportCodegen: Boolean = relation.useColumnarBatches
+
+  private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): 
ColumnarBatch = {
+val rowCount = cachedColumnarBatch.numRows
+val schema = cachedColumnarBatch.schema
+val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, 
schema)
+val columnarBatch = new ColumnarBatch(
+  schema, columnVectors.asInstanceOf[Array[ColumnVector]], rowCount)
+columnarBatch.setNumRows(rowCount)
+
+for (i <- 0 until cachedColumnarBatch.buffers.length) {
+  ColumnAccessor.decompress(
+cachedColumnarBatch.buffers(i), 
columnarBatch.column(i).asInstanceOf[WritableColumnVector],
+schema.fields(i).dataType, rowCount)
+}
+return columnarBatch
+  }
+
+  override def inputRDDs(): Seq[RDD[InternalRow]] = {
+if (supportCodegen) {
+  val buffers = relation.cachedColumnBuffers
+  // HACK ALERT: This is actually an RDD[ColumnarBatch].
+  // We're taking advantage of Scala's type erasure here to pass these 
batches along.
+  
Seq(buffers.map(createAndDecompressColumn(_)).asInstanceOf[RDD[InternalRow]])
--- End diff --

I think we need to apply column pruning here, instead of adding 
`columnIndexes` and ask `ColumnarBatchScan` to do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757530
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
--- End diff --

Add a `[]` in to the data?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144757242
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

Yep!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757046
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

This exception looks a bit weird. How about `` `parse` is only used to 
parse the JSON input to the set of `InternalRow`s. Use 
`parseWithArrayOfPrimitiveSupport` when paring array of primitive data is 
needed``?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144756868
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

Just like to be consistent. Thanks. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144756726
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter(
   }
 }
 
-private[orc] object OrcRelation extends HiveInspectors {
-  // The references of Hive's classes will be minimized.
-  val ORC_COMPRESSION = "orc.compress"
--- End diff --

Oh, right, the `parquet`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144756632
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

If you want, no problem at all. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82782 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82782/testReport)**
 for PR 19502 at commit 
[`2b0a092`](https://github.com/apache/spark/commit/2b0a09258a35c07905bd1a0bb4e3bf0dba6b5e66).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144756503
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

Oh, I was thinking something like `COMPRESS.getAttribute.toUpperCase`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...

2017-10-15 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18747
  
ping @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144756453
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter(
   }
 }
 
-private[orc] object OrcRelation extends HiveInspectors {
-  // The references of Hive's classes will be minimized.
-  val ORC_COMPRESSION = "orc.compress"
--- End diff --

Yep. I agree. I don't think Apache ORC changes this in the future since 
this is a primitive configuration.
BTW, Thank you for pointing this doc. I'll fix some typos here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19363
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19363
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82776/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19363
  
**[Test build #82776 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82776/testReport)**
 for PR 19363 at commit 
[`9166640`](https://github.com/apache/spark/commit/91666403e287015d953c4b105df6bc35dae966c3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-15 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/18732
  
Grouped UDFs, or Grouped Vectorized UDFs.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82781 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82781/testReport)**
 for PR 19502 at commit 
[`b7335ac`](https://github.com/apache/spark/commit/b7335acbf456d123c8af2196e3f923fecc92d5a2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-15 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18732
  
How to name it? Group-By vectorized UDFs?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-10-15 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19317#discussion_r144755666
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging 
values within a
* partition, and the latter is used for merging values between 
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return 
their first argument
+   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
+   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
+   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
--- End diff --

Yeah, it will. Here the 'difference' means it directly returns a map to the 
driver rather than an RDD.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144755349
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

This is case sensitive test case. We should not change this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144755348
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -89,6 +95,24 @@ class JacksonParser(
 
   /**
* Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is an overloaded 
method to the
+   * previous one which allows to handle array of primitive types.
+   */
+  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+  case START_ARRAY =>
+val array = convertArray(parser, makeConverter(at.elementType))
+if (array.numElements() == 0) {
+  Nil
+} else {
+  array.toArray(at.elementType).toSeq
+}
+  case _ => Nil
--- End diff --

Should we return `Nil` when it is not parsed to array? The original 
`makeRootConverter` didn't do this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144755323
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 // Respect `orc.compress`.
--- End diff --

Sure. Thank you for review, @viirya .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-10-15 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/18732
  
I'm +0 for now.
I'm just wondering whether we can support struct types in vectorized UDF 
when needed in the future.

As for adding pandas UDAF, I think we need another decorator or something 
to specify it supports partial aggregation or not and the related parameters if 
needed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19494: [SPARK-22249][SQL] isin with empty list throws exception...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19494
  
**[Test build #82780 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82780/testReport)**
 for PR 19494 at commit 
[`f5bc105`](https://github.com/apache/spark/commit/f5bc10582f60bf130af1fefed5b9dc9fb3e748d9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19501
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19501
  
**[Test build #82774 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)**
 for PR 19501 at commit 
[`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
+
   // A `ValueConverter` is responsible for converting a value from 
`JsonParser`
   // to a value in a field for `InternalRow`.
   private type ValueConverter = JsonParser => AnyRef
 
   // `ValueConverter`s for the root schema for all fields in the schema
-  private val rootConverter = makeRootConverter(schema)
+  private val rootConverter = schema match {
+case s: StructType => makeRootConverter(s)
--- End diff --

It is kind of easy to confused. Please add comment to each case like:
```scala
private val rootConverter = schema match {
case s: StructType => makeRootConverter(s) // For struct or array of 
struct.
case a: ArrayType => makeRootConverter(a)  // For array of primitive 
types.
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19494: [SPARK-22249][SQL] isin with empty list throws exception...

2017-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19494
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754775
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -89,6 +95,24 @@ class JacksonParser(
 
   /**
* Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is an overloaded 
method to the
+   * previous one which allows to handle array of primitive types.
+   */
+  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+  case START_ARRAY =>
+val array = convertArray(parser, makeConverter(at.elementType))
--- End diff --

Move `makeConverter` outside the inner function and so we can reuse it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19501
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82774/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754134
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
--- End diff --

Are those necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754004
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
--- End diff --

After this change, it didn't always return `InternalRow`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19317#discussion_r144753719
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging 
values within a
* partition, and the latter is used for merging values between 
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return 
their first argument
+   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
+   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
+   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
--- End diff --

doesn't `aggregateByKey` perform map side combine?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144753618
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
--- End diff --

Not schema for row anymore. Maybe `dataSchema`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753514
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala ---
@@ -46,6 +36,20 @@ abstract class QueryTest extends PlanTest {
   Locale.setDefault(Locale.US)
 
   /**
+   * Makes sure the answer matches the expected result
+   */
+  def checkString(expected: String, actual: String): Unit = {
+if (expected != actual) {
+  fail(
+"The actual string gives wrong result:\n\n" + sideBySide(
--- End diff --

it's kind of an overkill, the `toString` we are testing always return a 
single line, just `assert` is enough.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144753534
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
 case st: StructType => st
 case ArrayType(st: StructType, _) => st
+case ArrayType(at: AtomicType, _) => ArrayType(at)
   }
 
   // This converts parsed rows to the desired output by the given schema.
   @transient
-  lazy val converter = schema match {
-case _: StructType =>
-  (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
-case ArrayType(_: StructType, _) =>
-  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
-  }
+  lazy val converter = (rows: Seq[Any]) =>
--- End diff --

This brings extra matching cost at runtime. Can we move matching outside?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753446
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/RelationalGroupedDatasetSuite.scala
 ---
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.SharedSQLContext
+
+class RelationalGroupedDatasetSuite extends QueryTest with 
SharedSQLContext {
--- End diff --

we can put the tests in `DataSetSuite`, as `DataSet.toString` is also 
tested there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753415
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/KeyValueGroupedDatasetSuite.scala 
---
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import org.apache.spark.sql.test.SharedSQLContext
+
+class KeyValueGroupedDatasetSuite extends QueryTest with SharedSQLContext {
--- End diff --

we can put the tests in `DataSetSuite`, as `DataSet.toString` is also 
tested there.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753249
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -564,4 +564,25 @@ class KeyValueGroupedDataset[K, V] private[sql](
   encoder: Encoder[R]): Dataset[R] = {
 cogroup(other)((key, left, right) => f.call(key, left.asJava, 
right.asJava).asScala)(encoder)
   }
+
+  override def toString: String = {
+val builder = new StringBuilder
+val kFields = kExprEnc.schema.map {
+  case f => s"${f.name}: ${f.dataType.simpleString(2)}"
+}
+val vFields = vExprEnc.schema.map {
+  case f => s"${f.name}: ${f.dataType.simpleString(2)}"
+}
+builder.append("KeyValueGroupedDataset: [key: [")
+builder.append(kFields.take(2).mkString(", "))
+if (kFields.length > 2) {
+  builder.append(" ... " + (kFields.length - 2) + " more field(s)")
+}
+builder.append("], value: [")
+builder.append(vFields.take(2).mkString(", "))
+if (vFields.length > 2) {
+  builder.append(" ... " + (vFields.length - 2) + " more field(s)")
+}
+builder.append("]]").toString()
--- End diff --

shall we include `df` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -465,6 +466,19 @@ class RelationalGroupedDataset protected[sql](
 
 Dataset.ofRows(df.sparkSession, plan)
   }
+
+  override def toString: String = {
+val builder = new StringBuilder
+builder.append("RelationalGroupedDataset: [groupingBy: [")
+val kFields = groupingExprs.map(_.asInstanceOf[NamedExpression]).map {
+  case f => s"${f.name}: ${f.dataType.simpleString(2)}"
+}
+builder.append(kFields.take(2).mkString(", "))
+if (kFields.length > 2) {
+  builder.append(" ... " + (kFields.length - 2) + " more field(s)")
+}
+builder.append(s"], df: ${df.toString}, type: $groupType]").toString()
--- End diff --

why include `df` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753122
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -465,6 +466,19 @@ class RelationalGroupedDataset protected[sql](
 
 Dataset.ofRows(df.sparkSession, plan)
   }
+
+  override def toString: String = {
+val builder = new StringBuilder
+builder.append("RelationalGroupedDataset: [groupingBy: [")
--- End diff --

`groupingBy` -> `group by`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753076
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -18,13 +18,13 @@
 package org.apache.spark.sql
 
 import scala.collection.JavaConverters._
+import scala.util.control.NonFatal
--- End diff --

unnecessary import


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19363#discussion_r144753090
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -21,6 +21,7 @@ import java.util.Locale
 
 import scala.collection.JavaConverters._
 import scala.language.implicitConversions
+import scala.util.control.NonFatal
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144752376
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala ---
@@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter(
   }
 }
 
-private[orc] object OrcRelation extends HiveInspectors {
-  // The references of Hive's classes will be minimized.
-  val ORC_COMPRESSION = "orc.compress"
--- End diff --

We have documented `orc.compress` as option explicitly in 
https://github.com/apache/spark/blob/d8f45408635d4fccac557cb1e877dfe9267fb326/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L523

Now we depends on the configuration name from an external library. But I 
think the configuration name should not be changed at all. So looks should be 
fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144752155
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala ---
@@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc
 
 import java.io.File
 
+import org.apache.orc.OrcConf.COMPRESS
 import org.scalatest.BeforeAndAfterAll
 
--- End diff --

Another test in `OrcSourceSuite`:


https://github.com/apache/spark/blob/d8f45408635d4fccac557cb1e877dfe9267fb326/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala#L153


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82775/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144751920
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 // Respect `orc.compress`.
--- End diff --

Also the test name `Respect orc.compress option when compression is unset`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82775 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82775/testReport)**
 for PR 19502 at commit 
[`3d18d75`](https://github.com/apache/spark/commit/3d18d751e82791e53cd9f17a528777aaf355390d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144751856
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 // Respect `orc.compress`.
--- End diff --

Shall we change this comment too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82772/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19218
  
Build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19218
  
**[Test build #82772 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82772/testReport)**
 for PR 19218 at commit 
[`aa31261`](https://github.com/apache/spark/commit/aa31261860a1042fca471e4cec801353e8a7e2ac).
 * This patch **fails Spark unit tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19500
  
**[Test build #82779 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82779/testReport)**
 for PR 19500 at commit 
[`934da69`](https://github.com/apache/spark/commit/934da69d4900a2f5eb09c4d88dd9eb1b17cd568e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...

2017-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19459
  
LGTM too but let me leave it to @ueshin just in case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r144750565
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala ---
@@ -29,4 +32,19 @@ private[sql] object PythonSQLUtils {
   def listBuiltinFunctionInfos(): Array[ExpressionInfo] = {
 FunctionRegistry.functionSet.flatMap(f => 
FunctionRegistry.builtin.lookupFunction(f)).toArray
   }
+
+  /**
+   * Python Callable function to convert ArrowPayloads into a 
[[DataFrame]].
+   *
+   * @param payloadRDD A JavaRDD of ArrowPayloads.
+   * @param schemaString JSON Formatted Schema for ArrowPayloads.
+   * @param sqlContext The active [[SQLContext]].
+   * @return The converted [[DataFrame]].
+   */
+  def arrowPayloadToDataFrame(
+   payloadRDD: JavaRDD[Array[Byte]],
+   schemaString: String,
+   sqlContext: SQLContext): DataFrame = {
--- End diff --

I can't believe I found this looks 5 spaces instead of 4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...

2017-10-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19459#discussion_r144750372
  
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -70,12 +70,12 @@ class DataFrame(object):
 .. versionadded:: 1.3
 """
 
-def __init__(self, jdf, sql_ctx):
+def __init__(self, jdf, sql_ctx, schema=None):
 self._jdf = jdf
 self.sql_ctx = sql_ctx
 self._sc = sql_ctx and sql_ctx._sc
 self.is_cached = False
-self._schema = None  # initialized lazily
+self._schema = schema  # initialized lazily if None
--- End diff --

@BryanCutler, what do you think about taking this out back? Maybe, I am too 
much worried but I think we maybe should avoid it to be assigned actually 
except for few special cases ...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82778 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82778/testReport)**
 for PR 19502 at commit 
[`a554112`](https://github.com/apache/spark/commit/a554112daa60a130074709c77bcdffbd443a7c2f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19500#discussion_r144749492
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
-val parquetTable = "parquetTable"
-val orcTable = "orcTable"
-withTable(parquetTable, orcTable) {
-  sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED 
AS PARQUET")
-  sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS 
ORC")
-  sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src")
-  sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src")
-
-  // the default value for `spark.sql.hive.convertMetastoreParquet` is 
true, here we just set it
-  // for robustness
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") {
-checkTableStats(parquetTable, hasSizeInBytes = false, 
expectedRowCounts = None)
-sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS")
-checkTableStats(parquetTable, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
-  }
-  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
-// We still can get tableSize from Hive before Analyze
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= None)
-sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS")
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= Some(500))
+Seq("orc", "parquet").foreach { format =>
+  Seq("true", "false").foreach { isConverted =>
--- End diff --

Thank you for review, @gatorsmile . Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144749283
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala ---
@@ -42,7 +44,7 @@ private[orc] class OrcOptions(
   val compressionCodec: String = {
 // `compression`, `orc.compress`, and 
`spark.sql.orc.compression.codec` are
 // in order of precedence from highest to lowest.
--- End diff --

I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19503: Create myspark

2017-10-15 Thread dahaian
GitHub user dahaian opened a pull request:

https://github.com/apache/spark/pull/19503

Create myspark

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dahaian/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19503.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19503


commit a816a24a032944559e1af594c35b9d32d11c83f0
Author: dahaian <735710...@qq.com>
Date:   2017-10-16T03:15:57Z

Create myspark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19503: Create myspark

2017-10-15 Thread dahaian
Github user dahaian closed the pull request at:

https://github.com/apache/spark/pull/19503


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19501
  
Thanks @HyukjinKwon @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82773/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19439
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...

2017-10-15 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19501
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19439
  
**[Test build #82773 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82773/testReport)**
 for PR 19439 at commit 
[`52632cf`](https://github.com/apache/spark/commit/52632cf600303d6a3512efe3e7a1c159ee10f05f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82771/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19502
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82771 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82771/testReport)**
 for PR 19502 at commit 
[`7416e4a`](https://github.com/apache/spark/commit/7416e4ad262c01ac5e52ff0e0e4077c8ad98d7e1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144746184
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala ---
@@ -42,7 +44,7 @@ private[orc] class OrcOptions(
   val compressionCodec: String = {
 // `compression`, `orc.compress`, and 
`spark.sql.orc.compression.codec` are
 // in order of precedence from highest to lowest.
--- End diff --

`COMPRESS.getAttribute` is not in our code base. Please change it to
> // `compression`, `orc.compress`(i.e., OrcConf.COMPRESS), and 
`spark.sql.orc.compression.codec` are


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...

2017-10-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19500#discussion_r144745990
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -937,26 +937,22 @@ class StatisticsSuite extends 
StatisticsCollectionTestBase with TestHiveSingleto
   }
 
   test("test statistics of LogicalRelation converted from Hive serde 
tables") {
-val parquetTable = "parquetTable"
-val orcTable = "orcTable"
-withTable(parquetTable, orcTable) {
-  sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED 
AS PARQUET")
-  sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS 
ORC")
-  sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src")
-  sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src")
-
-  // the default value for `spark.sql.hive.convertMetastoreParquet` is 
true, here we just set it
-  // for robustness
-  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") {
-checkTableStats(parquetTable, hasSizeInBytes = false, 
expectedRowCounts = None)
-sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS")
-checkTableStats(parquetTable, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
-  }
-  withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") {
-// We still can get tableSize from Hive before Analyze
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= None)
-sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS")
-checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts 
= Some(500))
+Seq("orc", "parquet").foreach { format =>
+  Seq("true", "false").foreach { isConverted =>
--- End diff --

We prefer to using `Seq(true, false)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19272
  
**[Test build #82777 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82777/testReport)**
 for PR 19272 at commit 
[`3f22efe`](https://github.com/apache/spark/commit/3f22efe65e5083722dead8f2f3729ed54b8dd530).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144745832
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala ---
@@ -42,7 +44,7 @@ private[orc] class OrcOptions(
   val compressionCodec: String = {
 // `compression`, `orc.compress`, and 
`spark.sql.orc.compression.codec` are
 // in order of precedence from highest to lowest.
--- End diff --

Thank you for review, @gatorsmile .
The name `orc.compress` and precedence order is not changed. Which part do 
you want to change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode

2017-10-15 Thread ArtRand
Github user ArtRand commented on the issue:

https://github.com/apache/spark/pull/18784
  
@jiangxb1987 yes. I'll work to review this ASAP. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19363
  
**[Test build #82776 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82776/testReport)**
 for PR 19363 at commit 
[`9166640`](https://github.com/apache/spark/commit/91666403e287015d953c4b105df6bc35dae966c3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...

2017-10-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19502#discussion_r144745102
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala ---
@@ -42,7 +44,7 @@ private[orc] class OrcOptions(
   val compressionCodec: String = {
 // `compression`, `orc.compress`, and 
`spark.sql.orc.compression.codec` are
 // in order of precedence from highest to lowest.
--- End diff --

Please update the comment. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-15 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144745063
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -270,6 +270,12 @@ package object config {
 .longConf
 .createWithDefault(4 * 1024 * 1024)
 
+  private[spark] val IGNORE_EMPTY_SPLITS = 
ConfigBuilder("spark.files.ignoreEmptySplits")
--- End diff --

I'll send a follow-up PR to fix this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...

2017-10-15 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19464#discussion_r144745022
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -270,6 +270,12 @@ package object config {
 .longConf
 .createWithDefault(4 * 1024 * 1024)
 
+  private[spark] val IGNORE_EMPTY_SPLITS = 
ConfigBuilder("spark.files.ignoreEmptySplits")
--- End diff --

This config should be made internal, and the name should be improved 
because it's not about spark files. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16964: [SPARK-19534][TESTS] Convert Java tests to use lambdas, ...

2017-10-15 Thread dahaian
Github user dahaian commented on the issue:

https://github.com/apache/spark/pull/16964
  
@zzcclp@srowen I have the same error. In 
JavaConsumerStrategySuite.java,error is as follows:
The method mapValues(Function1) is ambiguous for the type 
Map
I use java 8.I clean build. But the error still exist. What should I do?Any 
help is appreciated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19502
  
**[Test build #82775 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82775/testReport)**
 for PR 19502 at commit 
[`3d18d75`](https://github.com/apache/spark/commit/3d18d751e82791e53cd9f17a528777aaf355390d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...

2017-10-15 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19502
  
Thank you, @HyukjinKwon !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >