[GitHub] spark pull request #16979: [SPARK-19617][SS]Fix the race condition when star...
Github user zsxwing closed the pull request at: https://github.com/apache/spark/pull/16979 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16979: [SPARK-19617][SS]Fix the race condition when starting an...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/16979 Thanks. Merging to 2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17023: [SPARK-19695][SQL] Throw an exception if a `colum...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/17023 [SPARK-19695][SQL] Throw an exception if a `columnNameOfCorruptRecord` field violates requirements ## What changes were proposed in this pull request? This pr comes from #16928 and fixed a json behaviour along with the CSV one. ## How was this patch tested? Added tests in `JsonSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark SPARK-19695 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17023.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17023 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17022: Aqp 271
GitHub user ahshahid opened a pull request: https://github.com/apache/spark/pull/17022 Aqp 271 Looks like in Spark 2.0 the optimization of repeat aggregates being represented by a single aggregate was broken because of passing of resultId: ExprID in the constructor of AggregateExpression. Thus if the query is of type select avg(x), y from tab group by y having avg(x) > 0 ideally there should be only 1 aggregate evaluated . But since the resultID passed is different in the analyze phase, the distinct on aggregateExpressions, does not result in 1 aggregate . The change is to implement the equality/hashCode method of aggregate expression instead of relying on scala case class generated methods. and ensuring the dependency on the aggregate which gets removed, is correctly rewritten. The test for this bug is present in CommonBugTest of aqp (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/SnappyDataInc/spark AQP-271 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17022.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17022 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102380326 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/DeduplicationSuite.scala --- @@ -0,0 +1,252 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.sql.catalyst.streaming.InternalOutputModes._ +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +class DeduplicationSuite extends StateStoreMetricsTest with BeforeAndAfterAll { + + import testImplicits._ + + override def afterAll(): Unit = { +super.afterAll() +StateStore.stop() + } + + test("deduplication with all columns") { +val inputData = MemoryStream[String] +val result = inputData.toDS().dropDuplicates() + +testStream(result, Append)( + AddData(inputData, "a"), + CheckLastBatch("a"), + assertNumStateRows(total = 1, updated = 1), + AddData(inputData, "a"), + CheckLastBatch(), + assertNumStateRows(total = 1, updated = 0), + AddData(inputData, "b"), + CheckLastBatch("b"), + assertNumStateRows(total = 2, updated = 1) +) + } + + test("deduplication with some columns") { +val inputData = MemoryStream[(String, Int)] +val result = inputData.toDS().dropDuplicates("_1") + +testStream(result, Append)( + AddData(inputData, "a" -> 1), + CheckLastBatch("a" -> 1), + assertNumStateRows(total = 1, updated = 1), + AddData(inputData, "a" -> 2), // Dropped + CheckLastBatch(), + assertNumStateRows(total = 1, updated = 0), + AddData(inputData, "b" -> 1), + CheckLastBatch("b" -> 1), + assertNumStateRows(total = 2, updated = 1) +) + } + + test("multiple deduplications") { +val inputData = MemoryStream[(String, Int)] +val result = inputData.toDS().dropDuplicates().dropDuplicates("_1") + +testStream(result, Append)( + AddData(inputData, "a" -> 1), + CheckLastBatch("a" -> 1), + assertNumStateRows(total = Seq(1L, 1L), updated = Seq(1L, 1L)), + + AddData(inputData, "a" -> 2), // Dropped from the second `dropDuplicates` + CheckLastBatch(), + assertNumStateRows(total = Seq(1L, 2L), updated = Seq(0L, 1L)), + + AddData(inputData, "b" -> 1), + CheckLastBatch("b" -> 1), + assertNumStateRows(total = Seq(2L, 3L), updated = Seq(1L, 1L)) +) + } + + test("deduplication with watermark") { +val inputData = MemoryStream[Int] +val result = inputData.toDS() + .withColumn("eventTime", $"value".cast("timestamp")) + .withWatermark("eventTime", "10 seconds") + .dropDuplicates() + .select($"eventTime".cast("long").as[Long]) + +testStream(result, Append)( + AddData(inputData, (1 to 5).flatMap(_ => (10 to 15)): _*), + CheckLastBatch(10 to 15: _*), + assertNumStateRows(total = 6, updated = 6), + + AddData(inputData, 25), // Advance watermark to 15 seconds + CheckLastBatch(25), + assertNumStateRows(total = 7, updated = 1), + + AddData(inputData, 25), // Drop states less than watermark + CheckLastBatch(), + assertNumStateRows(total = 1, updated = 0), + + AddData(inputData, 10), // Should not emit anything as data less than watermark + CheckLastBatch(), + assertNumStateRows(total = 1, updated = 0), + + AddData(inputData, 45), // Advance watermark to 35 seconds + CheckLastBatch(45), + assertNumStateRows(total = 2, updated = 1), + + AddData(inputData, 45), // Drop states less than watermark +
[GitHub] spark issue #17013: [SPARK-19666][SQL] Skip a property without getter in Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17013 **[Test build #73258 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73258/testReport)** for PR 17013 at commit [`ed686fa`](https://github.com/apache/spark/commit/ed686fae82fcd8984817615955ff5b2caf24ea08). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17013: [SPARK-19666][SQL] Skip a property without getter in Jav...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17013 **[Test build #73257 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73257/testReport)** for PR 17013 at commit [`5808d71`](https://github.com/apache/spark/commit/5808d71c5284ce9fcaaaffb7ada6d91e34e0b29e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102379272 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala --- @@ -321,3 +327,66 @@ case class MapGroupsWithStateExec( } } } + + +/** Physical operator for executing streaming Deduplication. */ +case class DeduplicationExec( +keyExpressions: Seq[Attribute], +child: SparkPlan, +stateId: Option[OperatorStateId] = None, +eventTimeWatermark: Option[Long] = None) + extends UnaryExecNode with StateStoreWriter with WatermarkSupport { + + /** Distribute by grouping attributes */ + override def requiredChildDistribution: Seq[Distribution] = +ClusteredDistribution(keyExpressions) :: Nil + + override protected def doExecute(): RDD[InternalRow] = { +metrics // force lazy init at driver + +child.execute().mapPartitionsWithStateStore( + getStateId.checkpointLocation, + operatorId = getStateId.operatorId, + storeVersion = getStateId.batchId, + keyExpressions.toStructType, + child.output.toStructType, + sqlContext.sessionState, + Some(sqlContext.streams.stateStoreCoordinator)) { (store, iter) => + val getKey = GenerateUnsafeProjection.generate(keyExpressions, child.output) + val numOutputRows = longMetric("numOutputRows") + val numTotalStateRows = longMetric("numTotalStateRows") + val numUpdatedStateRows = longMetric("numUpdatedStateRows") + + + val baseIterator = watermarkPredicate match { +case Some(predicate) => iter.filter((row: InternalRow) => !predicate.eval(row)) +case None => iter + } + + while (baseIterator.hasNext) { +val row = baseIterator.next().asInstanceOf[UnsafeRow] +val key = getKey(row) +val value = store.get(key) +if (value.isEmpty) { + store.put(key.copy(), row.copy()) --- End diff -- How about `UnsafeRow.createFromByteArray(1, 1)` (1 byte array) rather than copying it completely? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17013: [SPARK-19666][SQL] Improve error message for JavaBean wi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17013 **[Test build #73256 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73256/testReport)** for PR 17013 at commit [`91cee26`](https://github.com/apache/spark/commit/91cee264e936421e33ecfb0cd5d3c3b4a474d4f2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102378915 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- But I am still wondering if we need to do copy like this here. Right, it is defensive, but as `BlockManager` is private to spark internal, and if all callers to it do not modify/release the byte buffer passed in, doesn't this defensive copy only cause performance regression? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102378877 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1996,7 +1996,7 @@ class Dataset[T] private[sql]( def dropDuplicates(colNames: Seq[String]): Dataset[T] = withTypedPlan { val resolver = sparkSession.sessionState.analyzer.resolver val allColumns = queryExecution.analyzed.output -val groupCols = colNames.flatMap { colName => +val groupCols = colNames.toSet.toSeq.flatMap { (colName: String) => --- End diff -- was this a bug with batch queries as well? and what would the result be without this fix? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102378594 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala --- @@ -129,6 +156,33 @@ class UnsupportedOperationsSuite extends SparkFunSuite { outputMode = Complete, expectedMsgs = Seq("(map/flatMap)GroupsWithState")) + assertSupportedInStreamingPlan( +"mapGroupsWithState - mapGroupsWithState on batch relation inside streaming relation", +MapGroupsWithState(null, att, att, Seq(att), Seq(att), att, att, Seq(att), batchRelation), +outputMode = Append + ) + + // Deduplication: Not supported after a streaming aggregation + assertSupportedInStreamingPlan( +"Deduplication - Deduplication on streaming relation before aggregation", +Aggregate( + Seq(attributeWithWatermark), + aggExprs("c"), + Deduplication(Seq(att), streamRelation, streaming = true)), +outputMode = Append) + + assertNotSupportedInStreamingPlan( +"Deduplication - Deduplication on streaming relation after aggregation", +Deduplication(Seq(att), Aggregate(Nil, aggExprs("c"), streamRelation), streaming = true), +outputMode = Complete, +expectedMsgs = Seq("dropDuplicates")) + + assertSupportedInStreamingPlan( +"Deduplication - Deduplication on batch relation inside streaming relation", --- End diff -- nit: inside a streaming query. sounds weird otherwise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102378522 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala --- @@ -129,6 +156,33 @@ class UnsupportedOperationsSuite extends SparkFunSuite { outputMode = Complete, expectedMsgs = Seq("(map/flatMap)GroupsWithState")) + assertSupportedInStreamingPlan( +"mapGroupsWithState - mapGroupsWithState on batch relation inside streaming relation", +MapGroupsWithState(null, att, att, Seq(att), Seq(att), att, att, Seq(att), batchRelation), +outputMode = Append + ) + + // Deduplication: Not supported after a streaming aggregation --- End diff -- nit: Change this comment to just `// Deduplication` to reflect the whole subsection --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17021: [SPARK-19694][ML] Add missing 'setTopicDistributionCol' ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17021 **[Test build #73255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73255/testReport)** for PR 17021 at commit [`367a681`](https://github.com/apache/spark/commit/367a681ffa457f906a1f54cced4b8f8219e2f888). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102378336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1143,6 +1144,24 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] { } /** + * Replaces logical [[Deduplication]] operator with an [[Aggregate]] operator. + */ +object ReplaceDeduplicationWithAggregate extends Rule[LogicalPlan] { --- End diff -- `ReplaceDeduplicateWithAggregate`. see comment below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16970: [SPARK-19497][SS]Implement streaming deduplicatio...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16970#discussion_r102378293 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -869,3 +869,12 @@ case object OneRowRelation extends LeafNode { override def output: Seq[Attribute] = Nil override def computeStats(conf: CatalystConf): Statistics = Statistics(sizeInBytes = 1) } + +/** A logical plan for `dropDuplicates`. */ +case class Deduplication( --- End diff -- Most names are like "verbs" - aggregate, project, intersect. I think its best to name this "Deduplicate". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16979: [SPARK-19617][SS]Fix the race condition when starting an...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/16979 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17021: [SPARK-19694][ML] Add missing 'setTopicDistributi...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/17021 [SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModel ## What changes were proposed in this pull request? Add missing 'setTopicDistributionCol' for LDAModel ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark lda_outputCol Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17021.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17021 commit 367a681ffa457f906a1f54cced4b8f8219e2f888 Author: Zheng RuiFengDate: 2017-02-22T03:39:10Z create pr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17004: [SPARK-19670] [SQL] [TEST] Enable Bucketed Table ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17004 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16928 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16928 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73250/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16928 **[Test build #73250 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73250/testReport)** for PR 16928 at commit [`448e6fe`](https://github.com/apache/spark/commit/448e6fe9f20f11c1171bcaeebf27620fd2f93ac3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17004: [SPARK-19670] [SQL] [TEST] Enable Bucketed Table Reading...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/17004 Thanks! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102377114 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- Oh. nvm. That `duplicate` doesn't actually copy buffer content... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16928 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73249/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16928 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17020: [SPARK-19693][SQL] Make the SET mapreduce.job.reduces au...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17020 **[Test build #73254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73254/testReport)** for PR 17020 at commit [`7948466`](https://github.com/apache/spark/commit/79484664490cb24ae3cf51667758902edfb6b896). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16928 **[Test build #73249 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73249/testReport)** for PR 16928 at commit [`8e83522`](https://github.com/apache/spark/commit/8e8352259d9eac1eb4ac973811379af7dfd6ed83). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102376897 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- No. I meant do we actually need to defensive copy here? All usage of `putBytes` across Spark have duplicated the byte buffer before passing it in. Is there any missing case we should do this defensive copy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17020: [SPARK-19693][SQL] Make the SET mapreduce.job.red...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/17020 [SPARK-19693][SQL] Make the SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions ## What changes were proposed in this pull request? Make the `SET mapreduce.job.reduces` automatically converted to `spark.sql.shuffle.partitions`, it's similar to `SET mapred.reduce.tasks`. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-19693 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17020.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17020 commit c29ca8a5158a5caab72a866fb2043d88683fb44a Author: Yuming WangDate: 2017-02-20T10:57:05Z SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions commit 79484664490cb24ae3cf51667758902edfb6b896 Author: Yuming Wang Date: 2017-02-22T03:23:12Z Make test case for better readability --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102376170 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Sounds good. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #73253 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73253/testReport)** for PR 17014 at commit [`a3f3bb6`](https://github.com/apache/spark/commit/a3f3bb66acbbd670369eb9939f73a2968ecc7649). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16970 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16970 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73247/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16970: [SPARK-19497][SS]Implement streaming deduplication
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16970 **[Test build #73247 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73247/testReport)** for PR 16970 at commit [`78dfdfe`](https://github.com/apache/spark/commit/78dfdfe20b6c7f788e5d289ecc63c325679ccd44). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Deduplication(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73244/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16744 **[Test build #73244 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73244/testReport)** for PR 16744 at commit [`d15affb`](https://github.com/apache/spark/commit/d15affb9d263c6201370f40b463b16ffd63c9ca2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73252/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #73252 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73252/testReport)** for PR 17014 at commit [`b81eeb7`](https://github.com/apache/spark/commit/b81eeb7c955fbba941237efa7760e41e190ef9ea). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17014 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17013#discussion_r102374176 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -123,7 +123,11 @@ object JavaTypeInference { val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") --- End diff -- Ah, sure. Then, let me ignore a property without the getter in `JavaTypeInference.inferDataType`, and allow empty property in `JavaTypeInference.serializerFor`/`deserializerFor`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16744 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73245/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17014 **[Test build #73252 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73252/testReport)** for PR 17014 at commit [`b81eeb7`](https://github.com/apache/spark/commit/b81eeb7c955fbba941237efa7760e41e190ef9ea). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73242/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16946 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16744 **[Test build #73245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73245/testReport)** for PR 16744 at commit [`b4bf3a8`](https://github.com/apache/spark/commit/b4bf3a8a74eaeb67aee60c73e24c8a69b145006a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16946 **[Test build #73242 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73242/testReport)** for PR 16946 at commit [`5aef8eb`](https://github.com/apache/spark/commit/5aef8eb569a5422be0c3c18906a04e173782c2ba). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16290: [SPARK-18817] [SPARKR] [SQL] Set default warehous...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16290#discussion_r102373053 --- Diff: R/pkg/R/sparkR.R --- @@ -376,6 +377,12 @@ sparkR.session <- function( overrideEnvs(sparkConfigMap, paramMap) } + # NOTE(shivaram): Set default warehouse dir to tmpdir to meet CRAN requirements + # See SPARK-18817 for more details + if (!exists("spark.sql.default.warehouse.dir", envir = sparkConfigMap)) { --- End diff -- actually we can, `SessionState.conf.settings` contains all the user-setted entries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102372757 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -45,24 +46,41 @@ private[csv] class UnivocityParser( // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) + corruptFieldIndex.foreach { corrFieldIndex => +require(schema(corrFieldIndex).dataType == StringType, + "A field for corrupt records must have a string type") +require(schema(corrFieldIndex).nullable, "A field for corrupt must be nullable") + } + + private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) + CSVUtils.verifySchema(inputSchema) + private val valueConverters = -schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray +inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray private val parser = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - private val indexArr: Array[Int] = { + private val indexArr: Array[(Int, Int)] = { --- End diff -- @HyukjinKwon Sorry, but my bad. we use the second index here: https://github.com/apache/spark/pull/16928/files#diff-d19881aceddcaa5c60620fdcda99b4c4R202. I added a test for this case in the latest commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16988: [SPARK-19658] [SQL] Set NumPartitions of RepartitionByEx...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16988 how about we set the `numPartitions` when we build `RepartitionByExpression`? the parser can also access the SQLConf. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17013#discussion_r102372119 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -123,7 +123,11 @@ object JavaTypeInference { val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") --- End diff -- I mean `JavaTypeInference`. Why we throw exception for a bean property without getter instead of not treating it as a property? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16928 **[Test build #73251 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73251/testReport)** for PR 16928 at commit [`619094a`](https://github.com/apache/spark/commit/619094a4dbb0e400daac0d94905b40df07b650b6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16979: [SPARK-19617][SS]Fix the race condition when star...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/16979#discussion_r102371334 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala --- @@ -63,8 +63,34 @@ class HDFSMetadataLog[T <: AnyRef : ClassTag](sparkSession: SparkSession, path: val metadataPath = new Path(path) protected val fileManager = createFileManager() - if (!fileManager.exists(metadataPath)) { -fileManager.mkdirs(metadataPath) + runUninterruptiblyIfLocal { +if (!fileManager.exists(metadataPath)) { + fileManager.mkdirs(metadataPath) +} + } + + private def runUninterruptiblyIfLocal[T](body: => T): T = { +if (fileManager.isLocalFileSystem && Thread.currentThread.isInstanceOf[UninterruptibleThread]) { --- End diff -- So we are changing this to a best-effort attempt, rather than the try-and-explicitly-fail attempt, in the case of a local file system... right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r102371290 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala --- @@ -690,10 +696,10 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat "different from the schema when this table was created by Spark SQL" + s"(${schemaFromTableProps.simpleString}). We have to fall back to the table schema " + "from Hive metastore which is not case preserving.") -hiveTable +hiveTable.copy(schemaPreservesCase = false) --- End diff -- this is not only about case-preserving, maybe we should leave it unchanged --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16989: [SPARK-19659] Fetch big blocks to disk when shuffle-read...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/16989 @squito Thanks a lot for your comments : ) Yes, There must be a design doc for discussing. I will prepare and post a pdf to jira. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16949: [SPARK-16122][CORE] Add rest api for job environment
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/16949 cc @srowen and @vanzin also. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProvider...
Github user uncleGen closed the pull request at: https://github.com/apache/spark/pull/17011 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17013#discussion_r102370203 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -123,7 +123,11 @@ object JavaTypeInference { val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") --- End diff -- I am sorry for the long comment that maybe a bit messed around. So, there are two code paths - `Encoders.bean(...)`: non-empty required OK & getter/setter required - `JavaTypeInference.inferDataType`: empty one OK & getter only OK Could I please ask which case you mean just to double check? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16946: [SPARK-19554][UI,YARN] Allow SHS URL to be used for trac...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/16946 On vacation back next Monday and will review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProviderSuite.S...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/17011 Don't run things as root. Please close this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17019 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17019 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73243/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17019: [SPARK-19652][UI] Do auth checks for REST API access (br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17019 **[Test build #73243 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73243/testReport)** for PR 17019 at commit [`21006ff`](https://github.com/apache/spark/commit/21006ff4a9d01c6da02b48d6fadac9b3d9a0273a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16928 **[Test build #73250 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73250/testReport)** for PR 16928 at commit [`448e6fe`](https://github.com/apache/spark/commit/448e6fe9f20f11c1171bcaeebf27620fd2f93ac3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r102369921 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -181,7 +186,8 @@ case class CatalogTable( viewText: Option[String] = None, comment: Option[String] = None, unsupportedFeatures: Seq[String] = Seq.empty, -tracksPartitionsInCatalog: Boolean = false) { +tracksPartitionsInCatalog: Boolean = false, +schemaPreservesCase: Boolean = true) { --- End diff -- and when we fix the schema and try to write it back, remember to remove this property first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r102369793 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -181,7 +186,8 @@ case class CatalogTable( viewText: Option[String] = None, comment: Option[String] = None, unsupportedFeatures: Seq[String] = Seq.empty, -tracksPartitionsInCatalog: Boolean = false) { +tracksPartitionsInCatalog: Boolean = false, +schemaPreservesCase: Boolean = true) { --- End diff -- shall we create a special table property? `object CatalogTable` defines some specify properties for view and we can follow it. If we keeps adding more parameters, we may blow up the `CatalogTable` one day... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17015 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73246/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102369500 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -96,31 +96,44 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister { filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] = { -val csvOptions = new CSVOptions(options, sparkSession.sessionState.conf.sessionLocalTimeZone) - +CSVUtils.verifySchema(dataSchema) val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val parsedOptions = new CSVOptions( + options, + sparkSession.sessionState.conf.sessionLocalTimeZone, + sparkSession.sessionState.conf.columnNameOfCorruptRecord) + +// Check a field requirement for corrupt records here to throw an exception in a driver side +dataSchema.getFieldIndex(parsedOptions.columnNameOfCorruptRecord).map { corruptFieldIndex => + val f = dataSchema(corruptFieldIndex) + if (f.dataType != StringType || !f.nullable) { --- End diff -- I remove the entry in `UnivocityParser`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17015 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17015: [SPARK-19678][SQL] remove MetastoreRelation
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17015 **[Test build #73246 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73246/testReport)** for PR 17015 at commit [`b61910e`](https://github.com/apache/spark/commit/b61910e8767a20cfe046bbb0fbf27305fceb7f32). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16928: [SPARK-18699][SQL] Put malformed tokens into a new field...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16928 **[Test build #73249 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73249/testReport)** for PR 16928 at commit [`8e83522`](https://github.com/apache/spark/commit/8e8352259d9eac1eb4ac973811379af7dfd6ed83). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17011: [SPARK-19676][CORE] Flaky test: FsHistoryProviderSuite.S...
Github user uncleGen commented on the issue: https://github.com/apache/spark/pull/17011 @srowen @vanzin I think the root cause is I test it in root user. So it always be readable no matter what access permission. IMHO, it is OK to add once extra access permission check, as the code scop is catching the AccessControlException. Maybe, I need to make the comment more clear. What's your opinion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16944: [SPARK-19611][SQL] Introduce configurable table s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r102369195 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -296,6 +296,21 @@ object SQLConf { .longConf .createWithDefault(250 * 1024 * 1024) + object HiveCaseSensitiveInferenceMode extends Enumeration { --- End diff -- we can follow `PARQUET_COMPRESSION` and write the string literal directly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17014: [SPARK-18608][ML][WIP] Fix double-caching in ML algorith...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/17014 @srowen @hhbyyh You are right. I will update this without breaking `train`. Thanks for pointing it out! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17013: [SPARK-19666][SQL] Improve error message for Java...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/17013#discussion_r102368765 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala --- @@ -123,7 +123,11 @@ object JavaTypeInference { val beanInfo = Introspector.getBeanInfo(typeToken.getRawType) val properties = beanInfo.getPropertyDescriptors.filterNot(_.getName == "class") --- End diff -- ideally a property without a getter is not a bean property, let's fix the empty property problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102367846 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -147,8 +165,6 @@ private[csv] class UnivocityParser( case udt: UserDefinedType[_] => (datum: String) => makeConverter(name, udt.sqlType, nullable, options) - -case _ => throw new RuntimeException(s"Unsupported type: ${dataType.typeName}") --- End diff -- okay --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102367607 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -189,9 +205,10 @@ private[csv] class UnivocityParser( } } - private def convertWithParseMode( - tokens: Array[String])(convert: Array[String] => InternalRow): Option[InternalRow] = { -if (options.dropMalformed && schema.length != tokens.length) { + private def parseWithMode(input: String)(convert: Array[String] => InternalRow) +: Option[InternalRow] = { --- End diff -- Oh, one more, it seems it fits into 100 line length limit? ```scala private def convertWithParseMode( input: String)(convert: Array[String] => InternalRow): Option[InternalRow] = { ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102367601 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -147,8 +165,6 @@ private[csv] class UnivocityParser( case udt: UserDefinedType[_] => (datum: String) => makeConverter(name, udt.sqlType, nullable, options) - -case _ => throw new RuntimeException(s"Unsupported type: ${dataType.typeName}") --- End diff -- I think we should keep it, although theoretically we won't hit it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102366629 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -45,24 +46,41 @@ private[csv] class UnivocityParser( // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) + corruptFieldIndex.foreach { corrFieldIndex => +require(schema(corrFieldIndex).dataType == StringType, + "A field for corrupt records must have a string type") +require(schema(corrFieldIndex).nullable, "A field for corrupt must be nullable") + } + + private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) + CSVUtils.verifySchema(inputSchema) + private val valueConverters = -schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray +inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray private val parser = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - private val indexArr: Array[Int] = { + private val indexArr: Array[(Int, Int)] = { --- End diff -- Actually, it seems we don't need both? It seems second one is not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102366771 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -45,24 +46,41 @@ private[csv] class UnivocityParser( // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) + corruptFieldIndex.foreach { corrFieldIndex => +require(schema(corrFieldIndex).dataType == StringType, + "A field for corrupt records must have a string type") +require(schema(corrFieldIndex).nullable, "A field for corrupt must be nullable") + } + + private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) + CSVUtils.verifySchema(inputSchema) --- End diff -- It seems `verifySchema` is being called duplicatedly (https://github.com/apache/spark/pull/16928/files/4eed4a4bea927c6365dac2e2734c895a0a1a0026#diff-a549ac2e19ee7486911e2e6403444d9dR99) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102366908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala --- @@ -96,31 +96,44 @@ class CSVFileFormat extends TextBasedFileFormat with DataSourceRegister { filters: Seq[Filter], options: Map[String, String], hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] = { -val csvOptions = new CSVOptions(options, sparkSession.sessionState.conf.sessionLocalTimeZone) - +CSVUtils.verifySchema(dataSchema) val broadcastedHadoopConf = sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) +val parsedOptions = new CSVOptions( + options, + sparkSession.sessionState.conf.sessionLocalTimeZone, + sparkSession.sessionState.conf.columnNameOfCorruptRecord) + +// Check a field requirement for corrupt records here to throw an exception in a driver side +dataSchema.getFieldIndex(parsedOptions.columnNameOfCorruptRecord).map { corruptFieldIndex => + val f = dataSchema(corruptFieldIndex) + if (f.dataType != StringType || !f.nullable) { --- End diff -- This check seems duplicated with https://github.com/apache/spark/pull/16928/files/4eed4a4bea927c6365dac2e2734c895a0a1a0026#diff-d19881aceddcaa5c60620fdcda99b4c4R51 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102366449 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -147,8 +165,6 @@ private[csv] class UnivocityParser( case udt: UserDefinedType[_] => (datum: String) => makeConverter(name, udt.sqlType, nullable, options) - -case _ => throw new RuntimeException(s"Unsupported type: ${dataType.typeName}") --- End diff -- Hm, wouldn't removing it cause `MatchError`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17001: [SPARK-19667][SQL]create table with hiveenabled in defau...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/17001 I'd like to treat this as a workaround, the location of default database is still invalid in cluster- B. We can make this logic more clear and consistent: the default database should not have a location, when we try to get the location of default DB, we should use the warehouse path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16744: [SPARK-19405][STREAMING] Support for cross-account Kines...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16744 **[Test build #73248 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73248/testReport)** for PR 16744 at commit [`da18da0`](https://github.com/apache/spark/commit/da18da0d98d1f4e433480de8df6f6c34b1e0fb39). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r102366702 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -204,10 +208,11 @@ class KinesisSequenceRangeIterator( * to get records from Kinesis), and get the next shard iterator for next consumption. */ private def getRecordsAndNextKinesisIterator( - shardIterator: String): (Iterator[Record], String) = { + shardIterator: String, recordCount: Int): (Iterator[Record], String) = { val getRecordsRequest = new GetRecordsRequest getRecordsRequest.setRequestCredentials(credentials) getRecordsRequest.setShardIterator(shardIterator) +getRecordsRequest.setLimit(recordCount) --- End diff -- ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16842: [SPARK-19304] [Streaming] [Kinesis] fix kinesis s...
Github user Gauravshah commented on a diff in the pull request: https://github.com/apache/spark/pull/16842#discussion_r102366680 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/KinesisBackedBlockRDD.scala --- @@ -36,7 +36,8 @@ import org.apache.spark.util.NextIterator /** Class representing a range of Kinesis sequence numbers. Both sequence numbers are inclusive. */ private[kinesis] case class SequenceNumberRange( -streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String) +streamName: String, shardId: String, fromSeqNumber: String, toSeqNumber: String, +recordCount: Int) --- End diff -- Not sure on upgrading, since for code upgrade we need to delete the checkpoint directory and start afresh. I did run this patch and was able to serialize the limit into checkpoint, ( not a scala pro though) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...
Github user budde commented on a diff in the pull request: https://github.com/apache/spark/pull/16744#discussion_r102366429 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kinesis + +import scala.collection.JavaConverters._ + +import com.amazonaws.auth._ + +import org.apache.spark.internal.Logging + +/** + * Serializable interface providing a method executors can call to obtain an + * AWSCredentialsProvider instance for authenticating to AWS services. + */ +private[kinesis] sealed trait SerializableCredentialsProvider extends Serializable { + /** + * Return an AWSCredentialProvider instance that can be used by the Kinesis Client + * Library to authenticate to AWS services (Kinesis, CloudWatch and DynamoDB). + */ + def provider: AWSCredentialsProvider +} + +/** Returns DefaultAWSCredentialsProviderChain for authentication. */ +private[kinesis] final case object DefaultCredentialsProvider + extends SerializableCredentialsProvider { + + def provider: AWSCredentialsProvider = new DefaultAWSCredentialsProviderChain +} + +/** + * Returns AWSStaticCredentialsProvider constructed using basic AWS keypair. Falls back to using + * DefaultAWSCredentialsProviderChain if unable to construct a AWSCredentialsProviderChain + * instance with the provided arguments (e.g. if they are null). + */ +private[kinesis] final case class BasicCredentialsProvider( +awsAccessKeyId: String, +awsSecretKey: String) extends SerializableCredentialsProvider with Logging { + + def provider: AWSCredentialsProvider = try { +new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKeyId, awsSecretKey)) + } catch { +case e: IllegalArgumentException => + logWarning("Unable to construct AWSStaticCredentialsProvider with provided keypair; " + +s"falling back to DefaultAWSCredentialsProviderChain: $e") --- End diff -- I went ahead and updated it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][WIP][PySpark] Using Apache Arrow to increa...
Github user wesm commented on the issue: https://github.com/apache/spark/pull/15821 The 0.2 Maven artifacts have been posted. I'll try to update the conda-forge packages this week -- if anyone can help with conda-forge maintenance that would be a big help. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/16744#discussion_r102366089 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kinesis + +import scala.collection.JavaConverters._ + +import com.amazonaws.auth._ + +import org.apache.spark.internal.Logging + +/** + * Serializable interface providing a method executors can call to obtain an + * AWSCredentialsProvider instance for authenticating to AWS services. + */ +private[kinesis] sealed trait SerializableCredentialsProvider extends Serializable { + /** + * Return an AWSCredentialProvider instance that can be used by the Kinesis Client + * Library to authenticate to AWS services (Kinesis, CloudWatch and DynamoDB). + */ + def provider: AWSCredentialsProvider +} + +/** Returns DefaultAWSCredentialsProviderChain for authentication. */ +private[kinesis] final case object DefaultCredentialsProvider + extends SerializableCredentialsProvider { + + def provider: AWSCredentialsProvider = new DefaultAWSCredentialsProviderChain +} + +/** + * Returns AWSStaticCredentialsProvider constructed using basic AWS keypair. Falls back to using + * DefaultAWSCredentialsProviderChain if unable to construct a AWSCredentialsProviderChain + * instance with the provided arguments (e.g. if they are null). + */ +private[kinesis] final case class BasicCredentialsProvider( +awsAccessKeyId: String, +awsSecretKey: String) extends SerializableCredentialsProvider with Logging { + + def provider: AWSCredentialsProvider = try { +new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKeyId, awsSecretKey)) + } catch { +case e: IllegalArgumentException => + logWarning("Unable to construct AWSStaticCredentialsProvider with provided keypair; " + +s"falling back to DefaultAWSCredentialsProviderChain: $e") --- End diff -- can do it in a separate PR if you like --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16744: [SPARK-19405][STREAMING] Support for cross-accoun...
Github user brkyvz commented on a diff in the pull request: https://github.com/apache/spark/pull/16744#discussion_r102366055 --- Diff: external/kinesis-asl/src/main/scala/org/apache/spark/streaming/kinesis/SerializableCredentialsProvider.scala --- @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.streaming.kinesis + +import scala.collection.JavaConverters._ + +import com.amazonaws.auth._ + +import org.apache.spark.internal.Logging + +/** + * Serializable interface providing a method executors can call to obtain an + * AWSCredentialsProvider instance for authenticating to AWS services. + */ +private[kinesis] sealed trait SerializableCredentialsProvider extends Serializable { + /** + * Return an AWSCredentialProvider instance that can be used by the Kinesis Client + * Library to authenticate to AWS services (Kinesis, CloudWatch and DynamoDB). + */ + def provider: AWSCredentialsProvider +} + +/** Returns DefaultAWSCredentialsProviderChain for authentication. */ +private[kinesis] final case object DefaultCredentialsProvider + extends SerializableCredentialsProvider { + + def provider: AWSCredentialsProvider = new DefaultAWSCredentialsProviderChain +} + +/** + * Returns AWSStaticCredentialsProvider constructed using basic AWS keypair. Falls back to using + * DefaultAWSCredentialsProviderChain if unable to construct a AWSCredentialsProviderChain + * instance with the provided arguments (e.g. if they are null). + */ +private[kinesis] final case class BasicCredentialsProvider( +awsAccessKeyId: String, +awsSecretKey: String) extends SerializableCredentialsProvider with Logging { + + def provider: AWSCredentialsProvider = try { +new AWSStaticCredentialsProvider(new BasicAWSCredentials(awsAccessKeyId, awsSecretKey)) + } catch { +case e: IllegalArgumentException => + logWarning("Unable to construct AWSStaticCredentialsProvider with provided keypair; " + +s"falling back to DefaultAWSCredentialsProviderChain: $e") --- End diff -- Use `"falling back to DefaultAWSCredentialsProviderChain.", e)` instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16819: [SPARK-16441][YARN] Set maxNumExecutor depends on...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/16819#discussion_r102365927 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala --- @@ -1193,6 +1189,37 @@ private[spark] class Client( } } + def init(): Unit = { +launcherBackend.connect() +// Setup the credentials before doing anything else, +// so we have don't have issues at any point. +setupCredentials() +yarnClient.init(yarnConf) +yarnClient.start() + +setMaxNumExecutors() + } + + /** + * If using dynamic allocation and user doesn't set spark.dynamicAllocation.maxExecutors + * then set the max number of executors depends on yarn cluster VCores Total. + * If not using dynamic allocation don't set it. + */ + private def setMaxNumExecutors(): Unit = { +if (Utils.isDynamicAllocationEnabled(sparkConf)) { + + val defaultMaxNumExecutors = DYN_ALLOCATION_MAX_EXECUTORS.defaultValue.get + if (defaultMaxNumExecutors == sparkConf.get(DYN_ALLOCATION_MAX_EXECUTORS)) { +val executorCores = sparkConf.getInt("spark.executor.cores", 1) +val maxNumExecutors = yarnClient.getNodeReports().asScala. --- End diff -- Good suggestion. I will try API first. Pseudo code: ``` import org.apache.hadoop.yarn.client.api.{YarnClient, YarnClientApplication} import scala.collection.JavaConverters._ import org.apache.hadoop.yarn.api.protocolrecords._ import org.apache.hadoop.yarn.api.records._ import org.apache.hadoop.yarn.conf.YarnConfiguration val yarnConf = new YarnConfiguration() val yarnClient = YarnClient.createYarnClient yarnClient.init(yarnConf) yarnClient.start() yarnClient.getRootQueueInfos ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16785: [SPARK-19443][SQL] The function to generate const...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16785#discussion_r102364604 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -314,7 +314,17 @@ abstract class UnaryNode extends LogicalPlan { * expressions with the corresponding alias */ protected def getAliasedConstraints(projectList: Seq[NamedExpression]): Set[Expression] = { -var allConstraints = child.constraints.asInstanceOf[Set[Expression]] +val relativeReferences = AttributeSet(projectList.collect { + case a: Alias => a +}.flatMap(_.references)) ++ outputSet + +// We only care about the constraints which refer to attributes in output and aliases. +// For example, for a constraint 'a > b', if 'a' is aliased to 'c', we need to get aliased +// constraint 'c > b' only if 'b' is in output. +var allConstraints = child.constraints.filter { constraint => + constraint.references.subsetOf(relativeReferences) --- End diff -- Yes. You can see the benchmark in the pr description. With pruning these attributes, the running time is cut half. If I understand your comment correctly, pruning them later in `QueryPlan` means we prune constraints which don't refer attributes in `outputSet`. But the pruning here is happened before the pruning you pointed out, we need to reduce the constraints taken for transforming aliasing attributes to lower the computation cost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16938: [SPARK-19583][SQL]CTAS for data source table with a crea...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16938 > CREATE TABLE ... (PARTITIONED BY ...) LOCATION path I think hive's behavior makes more sense. Users may wanna insert data to this table and put the data in a specified location, even it doesn't exist at the beginning. > CREATE TABLE ...(PARTITIONED BY ...) LOCATION path AS SELECT ... The reason applies here too. > CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ... When users don't specify the location, mostly they would expect this is a fresh table and the table path should not exist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16990: [SPARK-19660][CORE][SQL] Replace the configuration prope...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/16990 @srowen @felixcheung The SQL query is related to the file name, see: https://github.com/apache/spark/blob/v2.1.0/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala#L314 e.g; `set mapred.reduce.tasks=31`'s MD5 is `83c59d378571a6e487aa20217bd87817`, `set mapreduce.job.reduces=31`'s MD5 is `be2c0b32a02a1154bfdee1a52530f387`. So I change the following file names: ``` mv input12_hadoop20-0-db1cd54a4cb36de2087605f32e41824f input12_hadoop20-0-2b9ccaa793eae0e73bf76335d3d6880 mv join14_hadoop20-1-db1cd54a4cb36de2087605f32e41824f join14_hadoop20-1-2b9ccaa793eae0e73bf76335d3d6880 mv auto_join14_hadoop20-2-db1cd54a4cb36de2087605f32e41824f auto_join14_hadoop20-2-2b9ccaa793eae0e73bf76335d3d6880 mv groupby4_noskew-2-83c59d378571a6e487aa20217bd87817 groupby4_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby4_map-2-83c59d378571a6e487aa20217bd87817 groupby4_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby4_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby4_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby7_map-3-83c59d378571a6e487aa20217bd87817 groupby7_map-3-be2c0b32a02a1154bfdee1a52530f387 mv groupby2_limit-0-83c59d378571a6e487aa20217bd87817 groupby2_limit-0-be2c0b32a02a1154bfdee1a52530f387 mv groupby6_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby6_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby5_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby5_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby5_noskew-2-83c59d378571a6e487aa20217bd87817 groupby5_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby2_map-2-83c59d378571a6e487aa20217bd87817 groupby2_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby7_noskew-3-83c59d378571a6e487aa20217bd87817 groupby7_noskew-3-be2c0b32a02a1154bfdee1a52530f387 mv groupby1_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby1_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby8_map-2-83c59d378571a6e487aa20217bd87817 groupby8_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby6_noskew-2-83c59d378571a6e487aa20217bd87817 groupby6_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby7_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby7_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby7_noskew_multi_single_reducer-2-83c59d378571a6e487aa20217bd87817 groupby7_noskew_multi_single_reducer-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby_map_ppr-2-83c59d378571a6e487aa20217bd87817 groupby_map_ppr-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby8_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby8_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby6_map-2-83c59d378571a6e487aa20217bd87817 groupby6_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby1_noskew-2-83c59d378571a6e487aa20217bd87817 groupby1_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby1_limit-0-83c59d378571a6e487aa20217bd87817 groupby1_limit-0-be2c0b32a02a1154bfdee1a52530f387 mv groupby5_map-2-83c59d378571a6e487aa20217bd87817 groupby5_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby7_map_multi_single_reducer-2-83c59d378571a6e487aa20217bd87817 groupby7_map_multi_single_reducer-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby1_map-2-83c59d378571a6e487aa20217bd87817 groupby1_map-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby2_noskew-2-83c59d378571a6e487aa20217bd87817 groupby2_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby8_noskew-2-83c59d378571a6e487aa20217bd87817 groupby8_noskew-2-be2c0b32a02a1154bfdee1a52530f387 mv groupby2_map_skew-2-83c59d378571a6e487aa20217bd87817 groupby2_map_skew-2-be2c0b32a02a1154bfdee1a52530f387 mv
[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16998 @sameeragarwal That's correct. > By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems. As we use constraints in optimization, if we turn off constraint inference/propagation, wouldn't it miss optimization chance for query plans? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102364205 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -45,24 +46,41 @@ private[csv] class UnivocityParser( // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) + corruptFieldIndex.foreach { corrFieldIndex => +require(schema(corrFieldIndex).dataType == StringType, + "A field for corrupt records must have a string type") +require(schema(corrFieldIndex).nullable, "A field for corrupt must be nullable") + } + + private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) + CSVUtils.verifySchema(inputSchema) + private val valueConverters = -schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray +inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray private val parser = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - private val indexArr: Array[Int] = { + private val indexArr: Array[(Int, Int)] = { --- End diff -- add comment to explain these 2 ints. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16928: [SPARK-18699][SQL] Put malformed tokens into a ne...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/16928#discussion_r102364069 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala --- @@ -45,24 +46,41 @@ private[csv] class UnivocityParser( // A `ValueConverter` is responsible for converting the given value to a desired type. private type ValueConverter = String => Any + private val corruptFieldIndex = schema.getFieldIndex(options.columnNameOfCorruptRecord) + corruptFieldIndex.foreach { corrFieldIndex => +require(schema(corrFieldIndex).dataType == StringType, + "A field for corrupt records must have a string type") +require(schema(corrFieldIndex).nullable, "A field for corrupt must be nullable") + } + + private val inputSchema = StructType(schema.filter(_.name != options.columnNameOfCorruptRecord)) + CSVUtils.verifySchema(inputSchema) + private val valueConverters = -schema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray +inputSchema.map(f => makeConverter(f.name, f.dataType, f.nullable, options)).toArray private val parser = new CsvParser(options.asParserSettings) private var numMalformedRecords = 0 private val row = new GenericInternalRow(requiredSchema.length) - private val indexArr: Array[Int] = { + private val indexArr: Array[(Int, Int)] = { val fields = if (options.dropMalformed) { // If `dropMalformed` is enabled, then it needs to parse all the values // so that we can decide which row is malformed. requiredSchema ++ schema.filterNot(requiredSchema.contains(_)) } else { requiredSchema } -fields.map(schema.indexOf(_: StructField)).toArray +val fieldsWithIndexes = fields.zipWithIndex --- End diff -- isn't it just `fields.map(inputSchema.indexOf(_: StructField)).toArray`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16998: [SPARK-19665][SQL][WIP] Improve constraint propagation
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/16998 By the way, as an aside we should probably allow constraint inference/propagation to be turned off via a conf flag to provide a quick work around against these kind of problems. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org