[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21451 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21739 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21451 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93255/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93258/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21451: [SPARK-24296][CORE][WIP] Replicate large blocks as a str...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21451 **[Test build #93255 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93255/testReport)** for PR 21451 at commit [`335e26d`](https://github.com/apache/spark/commit/335e26d168dc99e7317175da8732ff691ff512f2). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class UploadBlockStream extends BlockTransferMessage ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21739: [SPARK-22187][SS] Update unsaferow format for saved stat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21739 **[Test build #93258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93258/testReport)** for PR 21739 at commit [`c262e87`](https://github.com/apache/spark/commit/c262e87afe8736febcb546827f0af22da14a02d9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21774: [SPARK-24811][SQL]Avro: add new function from_avro and t...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21774 Since Spark doesn't have a persistent UDF API like Hive UDF, I think this is the best we can do now. In the future we should migrate this to UDF API so that we can register it with a name and use it in SQL. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203607394 --- Diff: external/avro/src/test/scala/org/apache/spark/sql/avro/AvroCatalystDataConversionSuite.scala --- @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.avro + +import org.apache.avro.Schema + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.{AvroDataToCatalyst, CatalystDataToAvro, RandomDataGenerator} +import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} +import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, GenericInternalRow, Literal} +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData, MapData} +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class AvroCatalystDataConversionSuite extends SparkFunSuite with ExpressionEvalHelper { + + private def roundTripTest(data: Literal): Unit = { +val avroType = SchemaConverters.toAvroType(data.dataType, data.nullable) +checkResult(data, avroType, data.eval()) + } + + private def checkResult(data: Literal, avroType: Schema, expected: Any): Unit = { +checkEvaluation( + AvroDataToCatalyst(CatalystDataToAvro(data), new SerializableSchema(avroType)), + prepareExpectedResult(expected)) + } + + private def assertFail(data: Literal, avroType: Schema): Unit = { +intercept[java.io.EOFException] { + AvroDataToCatalyst(CatalystDataToAvro(data), new SerializableSchema(avroType)).eval() +} + } + + private val testingTypes = Seq( +BooleanType, +ByteType, +ShortType, +IntegerType, +LongType, +FloatType, +DoubleType, +DecimalType(8, 0), // 32 bits decimal without fraction +DecimalType(8, 4), // 32 bits decimal +DecimalType(16, 0), // 64 bits decimal without fraction +DecimalType(16, 11), // 64 bits decimal +DecimalType(38, 0), +DecimalType(38, 38), +StringType, +BinaryType) + + protected def prepareExpectedResult(expected: Any): Any = expected match { +// Spark decimal is converted to avro string= +case d: Decimal => UTF8String.fromString(d.toString) +// Spark byte and short both map to avro int +case b: Byte => b.toInt +case s: Short => s.toInt +case row: GenericInternalRow => InternalRow.fromSeq(row.values.map(prepareExpectedResult)) +case array: GenericArrayData => new GenericArrayData(array.array.map(prepareExpectedResult)) +case map: MapData => + val keys = new GenericArrayData( + map.keyArray().asInstanceOf[GenericArrayData].array.map(prepareExpectedResult)) + val values = new GenericArrayData( + map.valueArray().asInstanceOf[GenericArrayData].array.map(prepareExpectedResult)) + new ArrayBasedMapData(keys, values) +case other => other + } + + testingTypes.foreach { dt => +val seed = scala.util.Random.nextLong() +test(s"single $dt with seed $seed") { + val rand = new scala.util.Random(seed) + val data = RandomDataGenerator.forType(dt, rand = rand).get.apply() + val converter = CatalystTypeConverters.createToCatalystConverter(dt) + val input = Literal.create(converter(data), dt) + roundTripTest(input) +} + } + + for (_ <- 1 to 5) { +val seed = scala.util.Random.nextLong() +val rand = new scala.util.Random(seed) +val schema = RandomDataGenerator.randomSchema(rand, 5, testingTypes) +test(s"flat schema ${schema.catalogString} with seed $seed") { + val data = RandomDataGenerator.randomRow(rand, schema) + val converter = CatalystTypeConverters.createToCatalystConverter(schema) + val input = Literal.create(converter(data), schema) +
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203606947 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/package.scala --- @@ -36,4 +40,27 @@ package object avro { @scala.annotation.varargs def avro(sources: String*): DataFrame = reader.format("avro").load(sources: _*) } + --- End diff -- because avro data source is an external package like kafka data source. It's not available in `org.apache.spark.sql.functions` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203606818 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/CatalystDataToAvro.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream + +import org.apache.avro.generic.GenericDatumWriter +import org.apache.avro.io.{BinaryEncoder, EncoderFactory} + +import org.apache.spark.sql.avro.{AvroSerializer, SchemaConverters} +import org.apache.spark.sql.catalyst.expressions.{Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.types.{BinaryType, DataType} + +case class CatalystDataToAvro(child: Expression) extends UnaryExpression with CodegenFallback { + + override lazy val dataType: DataType = BinaryType --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203606783 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with CodegenFallback with ExpectsInputTypes { + + override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType) + + override lazy val dataType: DataType = --- End diff -- the `dataType` is needed in executor side to build `AvroDeserializer`, it's better to serialize it instead of recomputing it at executor side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203606677 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + +case class AvroDataToCatalyst(child: Expression, avroType: SerializableSchema) + extends UnaryExpression with CodegenFallback with ExpectsInputTypes { --- End diff -- good point. Since the implementation is short, I think it should be easy to codegen it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21774: [SPARK-24811][SQL]Avro: add new function from_avr...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21774#discussion_r203606284 --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala --- @@ -0,0 +1,58 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.avro.generic.GenericDatumReader +import org.apache.avro.io.{BinaryDecoder, DecoderFactory} + +import org.apache.spark.sql.avro.{AvroDeserializer, SchemaConverters, SerializableSchema} +import org.apache.spark.sql.catalyst.expressions.{ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback +import org.apache.spark.sql.types.{AbstractDataType, BinaryType, DataType} + --- End diff -- This is not a function expression like the ones in SQL core, so `ExpressionDescription` can't apply here. I think we can leave it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93253/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20838 **[Test build #93253 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93253/testReport)** for PR 20838 at commit [`2c4f15c`](https://github.com/apache/spark/commit/2c4f15c13efa8b181c8c53bd9a90f4f578a40169). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21700 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21700 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93256/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21700: [SPARK-24717][SS] Split out max retain version of state ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21700 **[Test build #93256 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93256/testReport)** for PR 21700 at commit [`cf78a2a`](https://github.com/apache/spark/commit/cf78a2a25791a683c0ee36b08bdc79edd54f212a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21782: [SPARK-24816][SQL] SQL interface support repartit...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21782#discussion_r203604973 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -394,6 +394,41 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter } } } + + ignore("Pushdown benchmark for RANGE PARTITION BY/DISTRIBUTE BY") { --- End diff -- how is this related to pushdown? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21469 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93257/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21469: [SPARK-24441][SS] Expose total estimated size of states ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21469 **[Test build #93257 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93257/testReport)** for PR 21469 at commit [`32d0418`](https://github.com/apache/spark/commit/32d041878b7dcd20794250853063dab4bac09118). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21813 **[Test build #93260 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93260/testReport)** for PR 21813 at commit [`b5ada3f`](https://github.com/apache/spark/commit/b5ada3feb7d243859714c04ec4fb8c225c1781e0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified// Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21813: [SPARK 24424][SQL] Support ANSI-SQL compliant syntax for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21813: [SPARK 24424] Support ANSI-SQL compliant syntax f...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/21813 [SPARK 24424] Support ANSI-SQL compliant syntax for GROUPING SET ## What changes were proposed in this pull request? Enhances the parser and analyzer to support ANSI compliant syntax for GROUPING SET. As part of this change we derive the grouping expressions from user supplied groupings in the grouping sets clause. ```SQL SELECT c1, c2, max(c3) FROM t1 GROUP BY GROUPING SETS ((c1), (c1, c2)) ``` ## How was this patch tested? Added tests in SQLQueryTestSuite and ResolveGroupingAnalyticsSuite. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark spark-24424 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21813.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21813 commit b5ada3feb7d243859714c04ec4fb8c225c1781e0 Author: Dilip Biswal Date: 2018-07-19T05:12:33Z [SPARK-24424] Support ANSI-SQL compliant syntax for GROUPING SET --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21732: [SPARK-24762][SQL] Aggregator should be able to use Opti...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21732 ping @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21698 > > checkpoint can not guarantee that you shall always get the same output ... > > IIRC we can checkpoint to HDFS? Then it becomes reliable. Sure, thanks for clarify on that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21758: [SPARK-24795][CORE] Implement barrier execution mode
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21758 @mridulm Sorry I missed that message, now I've updated the comment, we can continue the discussion on that thread. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21698 > checkpoint can not guarantee that you shall always get the same output ... IIRC we can checkpoint to HDFS? Then it becomes reliable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20856: [SPARK-23731][SQL] FileSourceScanExec throws NullPointer...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20856 @HyukjinKwon good analysis! Currently Spark is a little messy about what shall be serialized and sent to executors. Sometimes we just send an entire query tree but only read a few properties of it. It seems to me it would be better to always do codegen at driver side, to avoid complex expression/plan operations at executor side.(not sure if it's possible, cc @viirya @rednaxelafx @kiszk). For this particular problem, I think we can just change these `val`s to `lazy val` or `def` in `FileSourceScanExec`, with a unit test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21698: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21698 > I see some discussion about making shuffles deterministic, but it proved to be very difficult. Is there a prior discussion on this you can point me to? Is it that even if you used fetch-to-disk and had the shuffle-fetch side read the map-output in a set order, you'd still have random variations in spills? Related discussion can be found https://github.com/apache/spark/pull/20414 . Also, let me list some of the scenarios that might generate non-deterministic row ordering below: **Random Shuffle Blocks Fetch** We randomize the ordering of block fetches on purpose, for avoiding all the nodes fetching from the same node at the same time. That means, we send out multiple concurrent pull requests, and the fetched blocks are processed in FIFO. Therefore, the row ordering of shuffle output are non-deterministic. **Shuffle Merge With Spills** The shuffle using Aggregator (for instance, combineByKey) uses ExternalAppendOnlyMap to combine the values. The ExternalAppendOnlyMap claims that it keeps the row orders, but it actually uses the hash to compare the elements (i.e., HashComparator). Even though the sort algorithm is stable, the map sizes can be different when the spilling happens. The requests for additional memory might be in different orders. The spilling could be non deterministic and thus the resulting order can still be non-deterministic. **Read From External Data Source** Some external data sources might generate different row ordering of outputs on different read request. > since we only need to do this sort on RDDs post shuffle IIUC this is not the case in RDD.repartition(), see https://github.com/apache/spark/blob/94c67a76ec1fda908a671a47a2a1fa63b3ab1b06/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L453~L461 , it requires the input rows are ordered then perform a round-robin style data transformation, so I don't see what we can do if the input data type is not sortable. > on a fetch-failure in repartition, fail the entire job Currently I can't figure out a case that a customer may vote for this behavior change, esp. FetchFailure tends to occur more often on long-running jobs on big datasets compared to interactive queries. > We could add logic to detect whether even an order-dependent operation was safe to retry -- eg. repartition just after reading from hdfs or after a checkpoint can be done as it is now. Each stage would need to know this based on extra properties of all the RDDs it was defined from. This is something I'm also trying to figure out, that we shall enable users to tell Spark that an RDD will generate deterministic output, so you don't have to worry about data correctness issue over these RDDs. Please also note that actually checkpoint can not guarantee that you shall always get the same output on each read operation, because you may have executorLost, and then you have to recompute the partitions thus may fetch different data. > Honestly I consider this bug so serious I'd consider loudly warning from every api which suffers from this if we can't fix -- make them deprecated and log a warning. We shall definitely update the comments, but shall we make the apis deprecated? I can't say I agree or disagree on this. I'm still trying to extend the current approach to allow data correctness, and the code changes shall be well flagged off. Maybe we can revisit the deprecated apis proposal after that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21795: [SPARK-24840][SQL] do not use dummy filter to swi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21795 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21795: [SPARK-24840][SQL] do not use dummy filter to switch cod...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21795 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21635: [SPARK-24594][YARN] Introducing metrics for YARN
Github user attilapiros commented on a diff in the pull request: https://github.com/apache/spark/pull/21635#discussion_r203594956 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMasterSource.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.deploy.yarn + +import com.codahale.metrics.{Gauge, MetricRegistry} + +import org.apache.spark.metrics.source.Source + +private[spark] class ApplicationMasterSource(yarnAllocator: YarnAllocator) extends Source { + + override val sourceName: String = "applicationMaster" --- End diff -- I like the idea to make the metric names more app specific. So I will prepend the app ID to the sourcename. And rerun my test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21589: [SPARK-24591][CORE] Number of cores and executors in the...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/21589 Thank you, @HyukjinKwon There are a significant number of Spark users who use the Job Scheduler model with a SparkContext shared across many users and many Jobs. Promoting tools and patterns based upon the number of core or executors that a SparkContext has access to, encouraging users to create Jobs that try to use all of the available cores, very much leads those users in the wrong direction. As much as possible, the public API should target policy that addresses real user problems (all users, not just a subset), and avoid targeting the particulars of Spark's internal implementation. A `repartition` that is extended to support policy or goal declarations (things along the lines of `repartition(availableCores)`, `repartition(availableDataNodes)`, `repartition(availableExecutors)`, `repartition(unreservedCores)`, etc.), relying upon Spark's internals (with it's compete knowledge of the total number of cores and executors, scheduling pool shares, number of reserved Task nodes sought in barrier scheduling, number of active Jobs, Stages, Tasks and Sessions, etc.) may be something that I can get behind. Exposing a couple of current Spark scheduler implementation details in the expectation that some subset of users in some subset of use cases will be able to make correct use of them is not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21638: [SPARK-22357][CORE] SparkContext.binaryFiles igno...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/21638#discussion_r203589083 --- Diff: core/src/main/scala/org/apache/spark/input/PortableDataStream.scala --- @@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T] def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) -val defaultParallelism = sc.defaultParallelism +val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) --- End diff -- I metioned `BinaryFileRDD` not this method, you can check the code to see how it handles the default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21533 Please also update the title and PR description because we changed the proposed solution in the middle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21455: [SPARK-24093][DStream][Minor]Make some fields of ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21455 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21812: SPARK UI K8S : this parameter's illustration(spar...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21812 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21777: [WIP][SPARK-24498][SQL] Add JDK compiler for runtime cod...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21777 btw, it seems this pr exceeds the current timeout... Any way to temporarily make the timeout longer? We always need to configure timeout in the Jenkins-side like https://github.com/apache/spark/pull/20222#issuecomment-357004091? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18477: [SPARK-21261][DOCS]SQL Regex document fix
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18477 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21781: [INFRA] Close stale PR
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21781 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21787: [SPARK-24568] Code refactoring for DataType equal...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21787 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21095: [SPARK-23529][K8s] Support mounting hostPath volu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21095 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19233: [Spark-22008][Streaming]Spark Streaming Dynamic A...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19233 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21240: [SPARK-21274][SQL] Add a new generator function r...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21240 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12904: [SPARK-15125][SQL] Changing CSV data source mappi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12904 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20100: [SPARK-22913][SQL] Improved Hive Partition Prunin...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20100 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16910: [SPARK-19575][SQL]Reading from or writing to a hi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16910 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21731: Update example to work locally
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21731 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21453: Test branch to see how Scala 2.11.12 performs
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21453 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #12951: [SPARK-15176][Core] Add maxShares setting to Pool...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12951 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18918: [SPARK-21707][SQL]Improvement a special case for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18918 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13143: [SPARK-15359] [Mesos] Mesos dispatcher should han...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13143 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21726: Branch 2.3
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21726 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17422: [SPARK-20087] Attach accumulators / metrics to 'T...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17422 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19510: [SPARK-22292][Mesos] Added spark.mem.max support ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19510 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18268: [SPARK-21054] [SQL] Reset Command support reset s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18268 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20437: [SPARK-23270][Streaming][WEB-UI]FileInputDStream ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20437 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18034: [SPARK-20797][MLLIB]fix LocalLDAModel.save() bug.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18034 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20319: [SPARK-22884][ML][TESTS] ML test for StructuredSt...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20319 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20543: [SPARK-23357][CORE] 'SHOW TABLE EXTENDED LIKE pat...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20543 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19758: [SPARK-3162][MLlib] Local Tree Training Pt 1: Ref...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19758 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19274: [SPARK-22056][Streaming] Add subconcurrency for K...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19274 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17973: [SPARK-20731][SQL] Add ability to change or omit ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17973 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18229: [SPARK-20691][CORE] Difference between Storage Me...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18229 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20304: [SPARK-23139]Read eventLog file with mixed encodi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20304 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21261: [SPARK-24203][core] Add spark.executor.bindAddres...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21261 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20177: [SPARK-22954][SQL] Fix the exception thrown by An...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20177 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17894: [WIP][SPARK-17134][ML] Use level 2 BLAS operation...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17894 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19420: [SPARK-22191] [SQL] Add hive serde example with s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19420 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18125: [SPARK-20891][SQL] Reduce duplicate code typedagg...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18125 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19456: [SPARK] [Scheduler] Configurable default scheduli...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19456 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17092: [SPARK-18450][ML] Scala API Change for LSH AND-am...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17092 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17619: [SPARK-19755][Mesos] Blacklist is always active f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17619 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14653: [SPARK-10931][PYSPARK][ML] PySpark ML Models shou...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14653 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20090: [SPARK-22907]Clean broadcast garbage when IOExcep...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20090 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21781: [INFRA] Close stale PR
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21781 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21812: SPARK UI K8S : this parameter's illustration(spark.kuber...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21812 @hehuiyuan, please ask a question via a mailing list. See also https://spark.apache.org/community.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11005: [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBC...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11005 BTW, datasource v2 is in progress too to allow more push downs (see [SPARK-22386](https://issues.apache.org/jira/browse/SPARK-22386)). You might want to take a look --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11005: [SPARK-12506][SPARK-12126][SQL]use CatalystScan for JDBC...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11005 Technical reason: It's kind of risky to rely on `CatalystScan` and completely replace the interface. I think I already see some tests were disabled here. Also, there look potential better suggestions above. Practical reason: there are too many pending PRs as you see. If the author is not responsive and the PR is inactive to review comments, we better leave them closed for now - seems it's already stuck in few technical reasons. The author is welcome to reopen and other contributors are welcome to take over. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21804: [SPARK-24268][SQL] Use datatype.catalogString in ...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21804#discussion_r203585703 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala --- @@ -145,7 +145,7 @@ abstract class NumericType extends AtomicType { } -private[sql] object NumericType extends AbstractDataType { +private[spark] object NumericType extends AbstractDataType { --- End diff -- (This is just a question...) Is it ok for some types to have `private[spark]` and the others to have `private[sql]`? I feel a little inconsistent policy for that. Since the other components (e.g., `ml` and `mllib`) depend on the `sql` type system, is it bad to make all the modifiers in their types `private[spark]` consistently? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21782: [SPARK-24816][SQL] SQL interface support repartitionByRa...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21782 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21638: [SPARK-22357][CORE] SparkContext.binaryFiles ignore minP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21638 Yea, it's internal to Spark. Might be good to keep it but that concern should be secondary IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20146 Meanwhile, will try to take another look to reduce the time, or see if we can split the test, or we can request the time limit increase again as a last resort. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20146: [SPARK-11215][ML] Add multiple columns support to String...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20146 Yea, that looks due to time limit. I am seeing sometimes it hit the time limit issue. The problem is, it's kind of difficult to increase the build time (see also https://github.com/appveyor/ci/issues/517). I already increased it from 1 to 1.5 hours before and looks they encourage to split the tests when it hits the time limit which is quite difficult for our case because it takes most of time when it builds. It probably wouldn't not happen often in the master branch build because they allow cache but the cache does not work in PR builder. Usually I tried to reduce the time it takes in SparkR tests (for instance, https://github.com/apache/spark/pull/19816). I was thinking we have 20ish mins left given the AppVeyor build history (https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/history) given my observation so far. So I roughly guess the time limit issue is temporarily happening in the AppVeyor .. can you close and reopen it here to retrigger the AppVeyor build? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21799: [SPARK-24852][ML] Update spark.ml to use Instrumentation...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21799 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21799: [SPARK-24852][ML] Update spark.ml to use Instrumentation...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21799 **[Test build #93254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93254/testReport)** for PR 21799 at commit [`dddccf6`](https://github.com/apache/spark/commit/dddccf6090413867c1be5e8714acd5e463d0970a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21799: [SPARK-24852][ML] Update spark.ml to use Instrumentation...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21799 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93254/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21533 **[Test build #93259 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93259/testReport)** for PR 21533 at commit [`eb46ccf`](https://github.com/apache/spark/commit/eb46ccfec084c2439a26eee38015381f091fe164). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21533 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21533 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1110/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21734: [SPARK-24149][YARN][FOLLOW-UP] Add a config to co...
Github user LantaoJin commented on a diff in the pull request: https://github.com/apache/spark/pull/21734#discussion_r203584220 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala --- @@ -193,8 +193,7 @@ object YarnSparkHadoopUtil { sparkConf: SparkConf, hadoopConf: Configuration): Set[FileSystem] = { val filesystemsToAccess = sparkConf.get(FILESYSTEMS_TO_ACCESS) - .map(new Path(_).getFileSystem(hadoopConf)) - .toSet +val isRequestAllDelegationTokens = filesystemsToAccess.isEmpty --- End diff -- @wangyum spark.yarn.access.hadoopFileSystems could be set with HA. For example: ` --conf spark.yarn.access.namenodes hdfs://cluster1-ha,hdfs://cluster2-ha` in hdfs-site.xml `` `dfs.nameservices` `cluster1-ha,cluster2-ha` `` `` `dfs.ha.namenodes.cluster1-ha` `nn1,nn2` `` `` `dfs.ha.namenodes.cluster2-ha` `nn1,nn2` `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21812: SPARK UI K8S : this parameter's illustration(spark.kuber...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21812 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21812: SPARK UI K8S : this parameter's illustration(spark.kuber...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21812 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21812: SPARK UI K8S : this parameter's illustration(spark.kuber...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21812 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21803: [SPARK-24849][SQL] Converting a value of StructTy...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21803#discussion_r203584039 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala --- @@ -436,6 +436,14 @@ object StructType extends AbstractDataType { */ def fromDDL(ddl: String): StructType = CatalystSqlParser.parseTableSchema(ddl) + /** + * Converts a value of StructType to a string in DDL format. For example: + * `StructType(Seq(StructField("a", IntegerType)))` should be converted to `a int` + */ + def toDDL(struct: StructType): String = { +struct.map(field => s"${quoteIdentifier(field.name)} ${field.dataType.sql}").mkString(",") --- End diff -- Can this also handle the special character ('\n', '\t', '\', ...) that needs an escape? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21533: [SPARK-24195][Core] Bug fix for local:/ path in SparkCon...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21533 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org