[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 To clarify, we can look at changing the behavior (if required) in future - but that should be an explicit design choice informed by hadoop committer design. Until then, we should look to interoperate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19218 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82784/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19218 **[Test build #82784 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82784/testReport)** for PR 19218 at commit [`dfb36d9`](https://github.com/apache/spark/commit/dfb36d9dd040e19babdb17c4b8396654ebf18787). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19218 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/19497 @HyukjinKwon My intention was to preserve earlier behavior. Particularly for non-path based committer's, the `path` variable and its use/processing is not relevant, it makes more sense to ignore that codepath entirely. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19218 **[Test build #82784 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82784/testReport)** for PR 19218 at commit [`dfb36d9`](https://github.com/apache/spark/commit/dfb36d9dd040e19babdb17c4b8396654ebf18787). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19500#discussion_r144759023 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } test("test statistics of LogicalRelation converted from Hive serde tables") { -val parquetTable = "parquetTable" -val orcTable = "orcTable" -withTable(parquetTable, orcTable) { - sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") - sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") - sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") - sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") - - // the default value for `spark.sql.hive.convertMetastoreParquet` is true, here we just set it - // for robustness - withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") { -checkTableStats(parquetTable, hasSizeInBytes = false, expectedRowCounts = None) -sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") -checkTableStats(parquetTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) - } - withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { -// We still can get tableSize from Hive before Analyze -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) -sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS") -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) +Seq("orc", "parquet").foreach { format => --- End diff -- Maybe "orcTbl" and "parquetTbl" are better? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19500#discussion_r144759077 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } test("test statistics of LogicalRelation converted from Hive serde tables") { -val parquetTable = "parquetTable" -val orcTable = "orcTable" -withTable(parquetTable, orcTable) { - sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") - sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") - sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") - sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") - - // the default value for `spark.sql.hive.convertMetastoreParquet` is true, here we just set it - // for robustness - withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") { -checkTableStats(parquetTable, hasSizeInBytes = false, expectedRowCounts = None) -sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") -checkTableStats(parquetTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) - } - withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { -// We still can get tableSize from Hive before Analyze -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) --- End diff -- Could you explain why orc table has size before analyze command while parquet table does not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19500#discussion_r144759141 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } test("test statistics of LogicalRelation converted from Hive serde tables") { -val parquetTable = "parquetTable" -val orcTable = "orcTable" -withTable(parquetTable, orcTable) { - sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") - sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") - sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") - sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") - - // the default value for `spark.sql.hive.convertMetastoreParquet` is true, here we just set it - // for robustness - withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") { -checkTableStats(parquetTable, hasSizeInBytes = false, expectedRowCounts = None) -sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") -checkTableStats(parquetTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) - } - withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { -// We still can get tableSize from Hive before Analyze -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) -sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS") -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) +Seq("orc", "parquet").foreach { format => + Seq(true, false).foreach { isConverted => +withSQLConf( + HiveUtils.CONVERT_METASTORE_ORC.key -> s"$isConverted", + HiveUtils.CONVERT_METASTORE_PARQUET.key -> s"$isConverted") { + withTable(format) { +sql(s"CREATE TABLE $format (key STRING, value STRING) STORED AS $format") +sql(s"INSERT INTO TABLE $format SELECT * FROM src") + +val hasHiveStats = !isConverted --- End diff -- we can just inline this val --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19501 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19482: [SPARK-22264][DEPLOY] Add timeout for eventlog replaying...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19482 Cloud you help review this? @vanzin Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19501: [SPARK-22223][SQL] ObjectHashAggregate should not...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19501 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82778/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82778 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82778/testReport)** for PR 19502 at commit [`a554112`](https://github.com/apache/spark/commit/a554112daa60a130074709c77bcdffbd443a7c2f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82783 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82783/testReport)** for PR 19502 at commit [`02792d9`](https://github.com/apache/spark/commit/02792d953872ed6d718aa9ceb247affee5004b66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19458: [SPARK-22227][CORE] DiskBlockManager.getAllBlocks...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19458#discussion_r144758165 --- Diff: core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala --- @@ -100,7 +100,16 @@ private[spark] class DiskBlockManager(conf: SparkConf, deleteFilesOnStop: Boolea /** List all the blocks currently stored on disk by the disk manager. */ def getAllBlocks(): Seq[BlockId] = { -getAllFiles().map(f => BlockId(f.getName)) +getAllFiles().flatMap { f => + val blockId = BlockId.guess(f.getName) --- End diff -- I think we don't need to log here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() +checkAnswer( + dfInt.select(from_json($"value", ArrayType(IntegerType))), + Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil) + +val dfString = Seq("""["hello", "world", ""]""").toDS() +checkAnswer( + dfString.select(from_json($"value", ArrayType(StringType))), + Row(Seq("hello", "world", "")):: Nil) + +val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS() +val schema = ArrayType(TimestampType) +val options = Map("timestampFormat" -> "dd/MM/ HH:mm") + +checkAnswer( + dfTimestamp.select(from_json($"value", schema, options)), + Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0" + +val dfEmpty = Seq("""[]""").toDS() +checkAnswer( + dfEmpty.select(from_json($"value", ArrayType(StringType))), + Row(Nil):: Nil) --- End diff -- And also add a case that can fail the parsing like `Seq([1]", "[2, 3]", "[string]""").toDS()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19317#discussion_r144757881 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * as in scala.TraversableOnce. The former operation is used for merging values within a * partition, and the latter is used for merging values between partitions. To avoid memory * allocation, both of these functions are allowed to modify and return their first argument + * instead of creating a new U. This method is different from the ordinary "aggregateByKey" + * method, it directly returns a map to the driver, rather than a rdd. This will also perform + * the merging locally on each mapper before sending results to a reducer, similarly to a --- End diff -- what's the difference between `aggregateByKeyLocally` and `aggregateByKey(...).toLocalIterator`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18747: [SPARK-20822][SQL] Generate code to directly get ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/18747#discussion_r144757619 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala --- @@ -23,21 +23,53 @@ import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.QueryPlan import org.apache.spark.sql.catalyst.plans.physical.{HashPartitioning, Partitioning} -import org.apache.spark.sql.execution.LeafExecNode -import org.apache.spark.sql.execution.metric.SQLMetrics +import org.apache.spark.sql.execution.{ColumnarBatchScan, LeafExecNode} +import org.apache.spark.sql.execution.vectorized._ import org.apache.spark.sql.types.UserDefinedType case class InMemoryTableScanExec( attributes: Seq[Attribute], predicates: Seq[Expression], @transient relation: InMemoryRelation) - extends LeafExecNode { + extends LeafExecNode with ColumnarBatchScan { override protected def innerChildren: Seq[QueryPlan[_]] = Seq(relation) ++ super.innerChildren - override lazy val metrics = Map( -"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) + override def vectorTypes: Option[Seq[String]] = + Option(Seq.fill(attributes.length)(classOf[OnHeapColumnVector].getName)) + + override val columnIndexes = +attributes.map(a => relation.output.map(o => o.exprId).indexOf(a.exprId)).toArray + + override val supportCodegen: Boolean = relation.useColumnarBatches + + private def createAndDecompressColumn(cachedColumnarBatch: CachedBatch): ColumnarBatch = { +val rowCount = cachedColumnarBatch.numRows +val schema = cachedColumnarBatch.schema +val columnVectors = OnHeapColumnVector.allocateColumns(rowCount, schema) +val columnarBatch = new ColumnarBatch( + schema, columnVectors.asInstanceOf[Array[ColumnVector]], rowCount) +columnarBatch.setNumRows(rowCount) + +for (i <- 0 until cachedColumnarBatch.buffers.length) { + ColumnAccessor.decompress( +cachedColumnarBatch.buffers(i), columnarBatch.column(i).asInstanceOf[WritableColumnVector], +schema.fields(i).dataType, rowCount) +} +return columnarBatch + } + + override def inputRDDs(): Seq[RDD[InternalRow]] = { +if (supportCodegen) { + val buffers = relation.cachedColumnBuffers + // HACK ALERT: This is actually an RDD[ColumnarBatch]. + // We're taking advantage of Scala's type erasure here to pass these batches along. + Seq(buffers.map(createAndDecompressColumn(_)).asInstanceOf[RDD[InternalRow]]) --- End diff -- I think we need to apply column pruning here, instead of adding `columnIndexes` and ask `ColumnarBatchScan` to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757530 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() --- End diff -- Add a `[]` in to the data? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144757242 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- Yep! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757046 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- This exception looks a bit weird. How about `` `parse` is only used to parse the JSON input to the set of `InternalRow`s. Use `parseWithArrayOfPrimitiveSupport` when paring array of primitive data is needed``? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144756868 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- Just like to be consistent. Thanks. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144756726 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter( } } -private[orc] object OrcRelation extends HiveInspectors { - // The references of Hive's classes will be minimized. - val ORC_COMPRESSION = "orc.compress" --- End diff -- Oh, right, the `parquet`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144756632 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- If you want, no problem at all. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82782 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82782/testReport)** for PR 19502 at commit [`2b0a092`](https://github.com/apache/spark/commit/2b0a09258a35c07905bd1a0bb4e3bf0dba6b5e66). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144756503 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- Oh, I was thinking something like `COMPRESS.getAttribute.toUpperCase`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18747: [SPARK-20822][SQL] Generate code to directly get value f...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/18747 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144756453 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter( } } -private[orc] object OrcRelation extends HiveInspectors { - // The references of Hive's classes will be minimized. - val ORC_COMPRESSION = "orc.compress" --- End diff -- Yep. I agree. I don't think Apache ORC changes this in the future since this is a primitive configuration. BTW, Thank you for pointing this doc. I'll fix some typos here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19363 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19363 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82776/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19363 **[Test build #82776 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82776/testReport)** for PR 19363 at commit [`9166640`](https://github.com/apache/spark/commit/91666403e287015d953c4b105df6bc35dae966c3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18732 Grouped UDFs, or Grouped Vectorized UDFs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82781 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82781/testReport)** for PR 19502 at commit [`b7335ac`](https://github.com/apache/spark/commit/b7335acbf456d123c8af2196e3f923fecc92d5a2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/18732 How to name it? Group-By vectorized UDFs? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
Github user ConeyLiu commented on a diff in the pull request: https://github.com/apache/spark/pull/19317#discussion_r144755666 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * as in scala.TraversableOnce. The former operation is used for merging values within a * partition, and the latter is used for merging values between partitions. To avoid memory * allocation, both of these functions are allowed to modify and return their first argument + * instead of creating a new U. This method is different from the ordinary "aggregateByKey" + * method, it directly returns a map to the driver, rather than a rdd. This will also perform + * the merging locally on each mapper before sending results to a reducer, similarly to a --- End diff -- Yeah, it will. Here the 'difference' means it directly returns a map to the driver rather than an RDD. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144755349 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- This is case sensitive test case. We should not change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144755348 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -89,6 +95,24 @@ class JacksonParser( /** * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is an overloaded method to the + * previous one which allows to handle array of primitive types. + */ + private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = { +(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) { + case START_ARRAY => +val array = convertArray(parser, makeConverter(at.elementType)) +if (array.numElements() == 0) { + Nil +} else { + array.toArray(at.elementType).toSeq +} + case _ => Nil --- End diff -- Should we return `Nil` when it is not parsed to array? The original `makeRootConverter` didn't do this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144755323 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { // Respect `orc.compress`. --- End diff -- Sure. Thank you for review, @viirya . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/18732 I'm +0 for now. I'm just wondering whether we can support struct types in vectorized UDF when needed in the future. As for adding pandas UDAF, I think we need another decorator or something to specify it supports partial aggregation or not and the related parameters if needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19494: [SPARK-22249][SQL] isin with empty list throws exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19494 **[Test build #82780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82780/testReport)** for PR 19494 at commit [`f5bc105`](https://github.com/apache/spark/commit/f5bc10582f60bf130af1fefed5b9dc9fb3e748d9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19501 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19501 **[Test build #82774 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82774/testReport)** for PR 19501 at commit [`45899fd`](https://github.com/apache/spark/commit/45899fd4fd4ee1efe5b75028c8fd9d2453c90aa8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754632 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) + // A `ValueConverter` is responsible for converting a value from `JsonParser` // to a value in a field for `InternalRow`. private type ValueConverter = JsonParser => AnyRef // `ValueConverter`s for the root schema for all fields in the schema - private val rootConverter = makeRootConverter(schema) + private val rootConverter = schema match { +case s: StructType => makeRootConverter(s) --- End diff -- It is kind of easy to confused. Please add comment to each case like: ```scala private val rootConverter = schema match { case s: StructType => makeRootConverter(s) // For struct or array of struct. case a: ArrayType => makeRootConverter(a) // For array of primitive types. } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19494: [SPARK-22249][SQL] isin with empty list throws exception...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19494 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -89,6 +95,24 @@ class JacksonParser( /** * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is an overloaded method to the + * previous one which allows to handle array of primitive types. + */ + private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = { +(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) { + case START_ARRAY => +val array = convertArray(parser, makeConverter(at.elementType)) --- End diff -- Move `makeConverter` outside the inner function and so we can reuse it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19501 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82774/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) --- End diff -- Are those necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754004 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. --- End diff -- After this change, it didn't always return `InternalRow`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19317#discussion_r144753719 --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala --- @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)]) * as in scala.TraversableOnce. The former operation is used for merging values within a * partition, and the latter is used for merging values between partitions. To avoid memory * allocation, both of these functions are allowed to modify and return their first argument + * instead of creating a new U. This method is different from the ordinary "aggregateByKey" + * method, it directly returns a map to the driver, rather than a rdd. This will also perform + * the merging locally on each mapper before sending results to a reducer, similarly to a --- End diff -- doesn't `aggregateByKey` perform map side combine? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144753618 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { --- End diff -- Not schema for row anymore. Maybe `dataSchema`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753514 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala --- @@ -46,6 +36,20 @@ abstract class QueryTest extends PlanTest { Locale.setDefault(Locale.US) /** + * Makes sure the answer matches the expected result + */ + def checkString(expected: String, actual: String): Unit = { +if (expected != actual) { + fail( +"The actual string gives wrong result:\n\n" + sideBySide( --- End diff -- it's kind of an overkill, the `toString` we are testing always return a single line, just `assert` is enough. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144753534 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { case st: StructType => st case ArrayType(st: StructType, _) => st +case ArrayType(at: AtomicType, _) => ArrayType(at) } // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = schema match { -case _: StructType => - (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null -case ArrayType(_: StructType, _) => - (rows: Seq[InternalRow]) => new GenericArrayData(rows) - } + lazy val converter = (rows: Seq[Any]) => --- End diff -- This brings extra matching cost at runtime. Can we move matching outside? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753446 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/RelationalGroupedDatasetSuite.scala --- @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.SharedSQLContext + +class RelationalGroupedDatasetSuite extends QueryTest with SharedSQLContext { --- End diff -- we can put the tests in `DataSetSuite`, as `DataSet.toString` is also tested there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753415 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/KeyValueGroupedDatasetSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.SharedSQLContext + +class KeyValueGroupedDatasetSuite extends QueryTest with SharedSQLContext { --- End diff -- we can put the tests in `DataSetSuite`, as `DataSet.toString` is also tested there. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753249 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -564,4 +564,25 @@ class KeyValueGroupedDataset[K, V] private[sql]( encoder: Encoder[R]): Dataset[R] = { cogroup(other)((key, left, right) => f.call(key, left.asJava, right.asJava).asScala)(encoder) } + + override def toString: String = { +val builder = new StringBuilder +val kFields = kExprEnc.schema.map { + case f => s"${f.name}: ${f.dataType.simpleString(2)}" +} +val vFields = vExprEnc.schema.map { + case f => s"${f.name}: ${f.dataType.simpleString(2)}" +} +builder.append("KeyValueGroupedDataset: [key: [") +builder.append(kFields.take(2).mkString(", ")) +if (kFields.length > 2) { + builder.append(" ... " + (kFields.length - 2) + " more field(s)") +} +builder.append("], value: [") +builder.append(vFields.take(2).mkString(", ")) +if (vFields.length > 2) { + builder.append(" ... " + (vFields.length - 2) + " more field(s)") +} +builder.append("]]").toString() --- End diff -- shall we include `df` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753173 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -465,6 +466,19 @@ class RelationalGroupedDataset protected[sql]( Dataset.ofRows(df.sparkSession, plan) } + + override def toString: String = { +val builder = new StringBuilder +builder.append("RelationalGroupedDataset: [groupingBy: [") +val kFields = groupingExprs.map(_.asInstanceOf[NamedExpression]).map { + case f => s"${f.name}: ${f.dataType.simpleString(2)}" +} +builder.append(kFields.take(2).mkString(", ")) +if (kFields.length > 2) { + builder.append(" ... " + (kFields.length - 2) + " more field(s)") +} +builder.append(s"], df: ${df.toString}, type: $groupType]").toString() --- End diff -- why include `df` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753122 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -465,6 +466,19 @@ class RelationalGroupedDataset protected[sql]( Dataset.ofRows(df.sparkSession, plan) } + + override def toString: String = { +val builder = new StringBuilder +builder.append("RelationalGroupedDataset: [groupingBy: [") --- End diff -- `groupingBy` -> `group by` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753076 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -18,13 +18,13 @@ package org.apache.spark.sql import scala.collection.JavaConverters._ +import scala.util.control.NonFatal --- End diff -- unnecessary import --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19363: [SPARK-22224][SQL]Override toString of KeyValue/R...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19363#discussion_r144753090 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -21,6 +21,7 @@ import java.util.Locale import scala.collection.JavaConverters._ import scala.language.implicitConversions +import scala.util.control.NonFatal --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144752376 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala --- @@ -255,10 +256,7 @@ private[orc] class OrcOutputWriter( } } -private[orc] object OrcRelation extends HiveInspectors { - // The references of Hive's classes will be minimized. - val ORC_COMPRESSION = "orc.compress" --- End diff -- We have documented `orc.compress` as option explicitly in https://github.com/apache/spark/blob/d8f45408635d4fccac557cb1e877dfe9267fb326/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L523 Now we depends on the configuration name from an external library. But I think the configuration name should not be changed at all. So looks should be fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144752155 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala --- @@ -19,6 +19,7 @@ package org.apache.spark.sql.hive.orc import java.io.File +import org.apache.orc.OrcConf.COMPRESS import org.scalatest.BeforeAndAfterAll --- End diff -- Another test in `OrcSourceSuite`: https://github.com/apache/spark/blob/d8f45408635d4fccac557cb1e877dfe9267fb326/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala#L153 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82775/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144751920 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { // Respect `orc.compress`. --- End diff -- Also the test name `Respect orc.compress option when compression is unset`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82775 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82775/testReport)** for PR 19502 at commit [`3d18d75`](https://github.com/apache/spark/commit/3d18d751e82791e53cd9f17a528777aaf355390d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144751856 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala --- @@ -180,7 +181,7 @@ class OrcQuerySuite extends QueryTest with BeforeAndAfterAll with OrcTest { // Respect `orc.compress`. --- End diff -- Shall we change this comment too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19218 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82772/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19218 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19218 **[Test build #82772 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82772/testReport)** for PR 19218 at commit [`aa31261`](https://github.com/apache/spark/commit/aa31261860a1042fca471e4cec801353e8a7e2ac). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite to test...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19500 **[Test build #82779 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82779/testReport)** for PR 19500 at commit [`934da69`](https://github.com/apache/spark/commit/934da69d4900a2f5eb09c4d88dd9eb1b17cd568e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark DataFra...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19459 LGTM too but let me leave it to @ueshin just in case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r144750565 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala --- @@ -29,4 +32,19 @@ private[sql] object PythonSQLUtils { def listBuiltinFunctionInfos(): Array[ExpressionInfo] = { FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray } + + /** + * Python Callable function to convert ArrowPayloads into a [[DataFrame]]. + * + * @param payloadRDD A JavaRDD of ArrowPayloads. + * @param schemaString JSON Formatted Schema for ArrowPayloads. + * @param sqlContext The active [[SQLContext]]. + * @return The converted [[DataFrame]]. + */ + def arrowPayloadToDataFrame( + payloadRDD: JavaRDD[Array[Byte]], + schemaString: String, + sqlContext: SQLContext): DataFrame = { --- End diff -- I can't believe I found this looks 5 spaces instead of 4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19459: [SPARK-20791][PYSPARK] Use Arrow to create Spark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19459#discussion_r144750372 --- Diff: python/pyspark/sql/dataframe.py --- @@ -70,12 +70,12 @@ class DataFrame(object): .. versionadded:: 1.3 """ -def __init__(self, jdf, sql_ctx): +def __init__(self, jdf, sql_ctx, schema=None): self._jdf = jdf self.sql_ctx = sql_ctx self._sc = sql_ctx and sql_ctx._sc self.is_cached = False -self._schema = None # initialized lazily +self._schema = schema # initialized lazily if None --- End diff -- @BryanCutler, what do you think about taking this out back? Maybe, I am too much worried but I think we maybe should avoid it to be assigned actually except for few special cases ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82778 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82778/testReport)** for PR 19502 at commit [`a554112`](https://github.com/apache/spark/commit/a554112daa60a130074709c77bcdffbd443a7c2f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19500#discussion_r144749492 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } test("test statistics of LogicalRelation converted from Hive serde tables") { -val parquetTable = "parquetTable" -val orcTable = "orcTable" -withTable(parquetTable, orcTable) { - sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") - sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") - sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") - sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") - - // the default value for `spark.sql.hive.convertMetastoreParquet` is true, here we just set it - // for robustness - withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") { -checkTableStats(parquetTable, hasSizeInBytes = false, expectedRowCounts = None) -sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") -checkTableStats(parquetTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) - } - withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { -// We still can get tableSize from Hive before Analyze -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) -sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS") -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) +Seq("orc", "parquet").foreach { format => + Seq("true", "false").foreach { isConverted => --- End diff -- Thank you for review, @gatorsmile . Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144749283 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala --- @@ -42,7 +44,7 @@ private[orc] class OrcOptions( val compressionCodec: String = { // `compression`, `orc.compress`, and `spark.sql.orc.compression.codec` are // in order of precedence from highest to lowest. --- End diff -- I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19503: Create myspark
GitHub user dahaian opened a pull request: https://github.com/apache/spark/pull/19503 Create myspark ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dahaian/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19503.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19503 commit a816a24a032944559e1af594c35b9d32d11c83f0 Author: dahaian <735710...@qq.com> Date: 2017-10-16T03:15:57Z Create myspark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19503: Create myspark
Github user dahaian closed the pull request at: https://github.com/apache/spark/pull/19503 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19501 Thanks @HyukjinKwon @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82773/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19439 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19501: [SPARK-22223][SQL] ObjectHashAggregate should not introd...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19501 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19439 **[Test build #82773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82773/testReport)** for PR 19439 at commit [`52632cf`](https://github.com/apache/spark/commit/52632cf600303d6a3512efe3e7a1c159ee10f05f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82771/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19502 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82771 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82771/testReport)** for PR 19502 at commit [`7416e4a`](https://github.com/apache/spark/commit/7416e4ad262c01ac5e52ff0e0e4077c8ad98d7e1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144746184 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala --- @@ -42,7 +44,7 @@ private[orc] class OrcOptions( val compressionCodec: String = { // `compression`, `orc.compress`, and `spark.sql.orc.compression.codec` are // in order of precedence from highest to lowest. --- End diff -- `COMPRESS.getAttribute` is not in our code base. Please change it to > // `compression`, `orc.compress`(i.e., OrcConf.COMPRESS), and `spark.sql.orc.compression.codec` are --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19500: [SPARK-22280][SQL][TEST] Improve StatisticsSuite ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19500#discussion_r144745990 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -937,26 +937,22 @@ class StatisticsSuite extends StatisticsCollectionTestBase with TestHiveSingleto } test("test statistics of LogicalRelation converted from Hive serde tables") { -val parquetTable = "parquetTable" -val orcTable = "orcTable" -withTable(parquetTable, orcTable) { - sql(s"CREATE TABLE $parquetTable (key STRING, value STRING) STORED AS PARQUET") - sql(s"CREATE TABLE $orcTable (key STRING, value STRING) STORED AS ORC") - sql(s"INSERT INTO TABLE $parquetTable SELECT * FROM src") - sql(s"INSERT INTO TABLE $orcTable SELECT * FROM src") - - // the default value for `spark.sql.hive.convertMetastoreParquet` is true, here we just set it - // for robustness - withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "true") { -checkTableStats(parquetTable, hasSizeInBytes = false, expectedRowCounts = None) -sql(s"ANALYZE TABLE $parquetTable COMPUTE STATISTICS") -checkTableStats(parquetTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) - } - withSQLConf(HiveUtils.CONVERT_METASTORE_ORC.key -> "true") { -// We still can get tableSize from Hive before Analyze -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = None) -sql(s"ANALYZE TABLE $orcTable COMPUTE STATISTICS") -checkTableStats(orcTable, hasSizeInBytes = true, expectedRowCounts = Some(500)) +Seq("orc", "parquet").foreach { format => + Seq("true", "false").foreach { isConverted => --- End diff -- We prefer to using `Seq(true, false)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19272: [Spark-21842][Mesos] Support Kerberos ticket renewal and...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19272 **[Test build #82777 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82777/testReport)** for PR 19272 at commit [`3f22efe`](https://github.com/apache/spark/commit/3f22efe65e5083722dead8f2f3729ed54b8dd530). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144745832 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala --- @@ -42,7 +44,7 @@ private[orc] class OrcOptions( val compressionCodec: String = { // `compression`, `orc.compress`, and `spark.sql.orc.compression.codec` are // in order of precedence from highest to lowest. --- End diff -- Thank you for review, @gatorsmile . The name `orc.compress` and precedence order is not changed. Which part do you want to change? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18784: [SPARK-21559][Mesos] remove mesos fine-grained mode
Github user ArtRand commented on the issue: https://github.com/apache/spark/pull/18784 @jiangxb1987 yes. I'll work to review this ASAP. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19363: [SPARK-22224][SQL]Override toString of KeyValue/Relation...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19363 **[Test build #82776 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82776/testReport)** for PR 19363 at commit [`9166640`](https://github.com/apache/spark/commit/91666403e287015d953c4b105df6bc35dae966c3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileF...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19502#discussion_r144745102 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala --- @@ -42,7 +44,7 @@ private[orc] class OrcOptions( val compressionCodec: String = { // `compression`, `orc.compress`, and `spark.sql.orc.compression.codec` are // in order of precedence from highest to lowest. --- End diff -- Please update the comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144745063 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -270,6 +270,12 @@ package object config { .longConf .createWithDefault(4 * 1024 * 1024) + private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits") --- End diff -- I'll send a follow-up PR to fix this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19464: [SPARK-22233] [core] Allow user to filter out emp...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19464#discussion_r144745022 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -270,6 +270,12 @@ package object config { .longConf .createWithDefault(4 * 1024 * 1024) + private[spark] val IGNORE_EMPTY_SPLITS = ConfigBuilder("spark.files.ignoreEmptySplits") --- End diff -- This config should be made internal, and the name should be improved because it's not about spark files. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16964: [SPARK-19534][TESTS] Convert Java tests to use lambdas, ...
Github user dahaian commented on the issue: https://github.com/apache/spark/pull/16964 @zzcclp@srowen I have the same error. In JavaConsumerStrategySuite.java,error is as follows: The method mapValues(Function1) is ambiguous for the type Map I use java 8.I clean build. But the error still exist. What should I do?Any help is appreciated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19502 **[Test build #82775 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82775/testReport)** for PR 19502 at commit [`3d18d75`](https://github.com/apache/spark/commit/3d18d751e82791e53cd9f17a528777aaf355390d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19502: [SPARK-22282][SQL] Rename OrcRelation to OrcFileFormat a...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19502 Thank you, @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org