[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark
[ https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541265#comment-16541265 ] Andrew Ash commented on SPARK-21962: Note that HTrace is now being removed from Hadoop – HADOOP-15566 > Distributed Tracing in Spark > > > Key: SPARK-21962 > URL: https://issues.apache.org/jira/browse/SPARK-21962 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > > Spark should support distributed tracing, which is the mechanism, widely > popularized by Google in the [Dapper > Paper|https://research.google.com/pubs/pub36356.html], where network requests > have additional metadata used for tracing requests between services. > This would be useful for me since I have OpenZipkin style tracing in my > distributed application up to the Spark driver, and from the executors out to > my other services, but the link is broken in Spark between driver and > executor since the Span IDs aren't propagated across that link. > An initial implementation could instrument the most important network calls > with trace ids (like launching and finishing tasks), and incrementally add > more tracing to other calls (torrent block distribution, external shuffle > service, etc) as the feature matures. > Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347480#comment-16347480 ] Andrew Ash commented on SPARK-23274: Many thanks for the fast fix [~smilegator]! > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.0 > > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.tree
[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345539#comment-16345539 ] Andrew Ash commented on SPARK-23274: Suspect this regression was introduced by [https://github.com/apache/spark/commit/01f6ba0e7a] > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Priority: Major > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943 ] Andrew Ash commented on SPARK-22982: [~joshrosen] do you have some example stacktraces of what this bug can cause? Several of our clusters hit what I think is this problem earlier this month, see below for details. For a few days in January (4th through 12th) on our AWS infra, we observed massively degraded disk read throughput (down to 33% of previous peaks). During this time, we also began observing intermittent exceptions coming from Spark at read time of parquet files that a previous Spark job had written. When the read throughput recovered on the 12th, we stopped observing the exceptions and haven't seen them since. At first we observed this stacktrace when reading .snappy.parquet files: {noformat} java.lang.RuntimeException: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486) at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490) ... 31 more Caused by: java.io
[jira] [Commented] (SPARK-22725) df.select on a Stream is broken, vs a List
[ https://issues.apache.org/jira/browse/SPARK-22725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281354#comment-16281354 ] Andrew Ash commented on SPARK-22725: Demonstration of difference between {{.map}} on List vs Stream: {noformat} scala> Stream(1,2,3).map { println(_) } 1 res1: scala.collection.immutable.Stream[Unit] = Stream((), ?) scala> List(1,2,3).map { println(_) } 1 2 3 res2: List[Unit] = List((), (), ()) scala> {noformat} > df.select on a Stream is broken, vs a List > -- > > Key: SPARK-22725 > URL: https://issues.apache.org/jira/browse/SPARK-22725 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Andrew Ash > > See failing test at https://github.com/apache/spark/pull/19917 > Failing: > {noformat} > test("SPARK-ABC123: support select with a splatted stream") { > val df = spark.createDataFrame(sparkContext.emptyRDD[Row], > StructType(List("bar", "foo").map { > StructField(_, StringType, false) > })) > val allColumns = Stream(df.col("bar"), col("foo")) > val result = df.select(allColumns : _*) > } > {noformat} > Succeeds: > {noformat} > test("SPARK-ABC123: support select with a splatted stream") { > val df = spark.createDataFrame(sparkContext.emptyRDD[Row], > StructType(List("bar", "foo").map { > StructField(_, StringType, false) > })) > val allColumns = Seq(df.col("bar"), col("foo")) > val result = df.select(allColumns : _*) > } > {noformat} > After stepping through in a debugger, the difference manifests at > https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120 > Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass. > I think there's a very subtle bug here where the {{Seq}} of column names > passed into {{select}} is expected to eagerly evaluate when {{.map}} is > called on it, even though that's not part of the {{Seq}} contract. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22725) df.select on a Stream is broken, vs a List
Andrew Ash created SPARK-22725: -- Summary: df.select on a Stream is broken, vs a List Key: SPARK-22725 URL: https://issues.apache.org/jira/browse/SPARK-22725 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Andrew Ash See failing test at https://github.com/apache/spark/pull/19917 Failing: {noformat} test("SPARK-ABC123: support select with a splatted stream") { val df = spark.createDataFrame(sparkContext.emptyRDD[Row], StructType(List("bar", "foo").map { StructField(_, StringType, false) })) val allColumns = Stream(df.col("bar"), col("foo")) val result = df.select(allColumns : _*) } {noformat} Succeeds: {noformat} test("SPARK-ABC123: support select with a splatted stream") { val df = spark.createDataFrame(sparkContext.emptyRDD[Row], StructType(List("bar", "foo").map { StructField(_, StringType, false) })) val allColumns = Seq(df.col("bar"), col("foo")) val result = df.select(allColumns : _*) } {noformat} After stepping through in a debugger, the difference manifests at https://github.com/apache/spark/blob/8ae004b4602266d1f210e4c1564246d590412c06/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala#L120 Changing {{seq.map}} to {{seq.toList.map}} causes the test to pass. I think there's a very subtle bug here where the {{Seq}} of column names passed into {{select}} is expected to eagerly evaluate when {{.map}} is called on it, even though that's not part of the {{Seq}} contract. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22479) SaveIntoDataSourceCommand logs jdbc credentials
[ https://issues.apache.org/jira/browse/SPARK-22479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246328#comment-16246328 ] Andrew Ash commented on SPARK-22479: Completely agree that credentials shouldn't be in the toString since query plans are logged in many places. This looks like it brings SaveIntoDataSourceCommand more in-line with JdbcRelation, which also currently redacts credentials from its toString to avoid them being written to logs. > SaveIntoDataSourceCommand logs jdbc credentials > --- > > Key: SPARK-22479 > URL: https://issues.apache.org/jira/browse/SPARK-22479 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Onur Satici > > JDBC credentials are not redacted in plans including a > 'SaveIntoDataSourceCommand'. > Steps to reproduce: > {code} > spark-shell --packages org.postgresql:postgresql:42.1.1 > {code} > {code} > import org.apache.spark.sql.execution.QueryExecution > import org.apache.spark.sql.util.QueryExecutionListener > val listener = new QueryExecutionListener { > override def onFailure(funcName: String, qe: QueryExecution, exception: > Exception): Unit = {} > override def onSuccess(funcName: String, qe: QueryExecution, duration: > Long): Unit = { > System.out.println(qe.toString()) > } > } > spark.listenerManager.register(listener) > spark.range(100).write.format("jdbc").option("url", > "jdbc:postgresql:sparkdb").option("password", "pass").option("driver", > "org.postgresql.Driver").option("dbtable", "test").save() > {code} > The above will yield the following plan: > {code} > == Parsed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Analyzed Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Optimized Logical Plan == > SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists >+- Range (0, 100, step=1, splits=Some(8)) > == Physical Plan == > ExecutedCommand >+- SaveIntoDataSourceCommand jdbc, Map(dbtable -> test10, driver -> > org.postgresql.Driver, url -> jdbc:postgresql:sparkdb, password -> pass), > ErrorIfExists > +- Range (0, 100, step=1, splits=Some(8)) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
Andrew Ash created SPARK-22470: -- Summary: Doc that functions.hash is also used internally for shuffle and bucketing Key: SPARK-22470 URL: https://issues.apache.org/jira/browse/SPARK-22470 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 2.2.0 Reporter: Andrew Ash https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that appears to be the same hash function as what Spark uses internally for shuffle and bucketing. One of my users would like to bake this assumption into code, but is unsure if it's a guarantee or a coincidence that they're the same function. Would it be considered an API break if at some point the two functions were different, or if the implementation of both changed together? We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22042) ReorderJoinPredicates can break when child's partitioning is not decided
[ https://issues.apache.org/jira/browse/SPARK-22042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220734#comment-16220734 ] Andrew Ash commented on SPARK-22042: Hi I'm seeing this problem as well, thanks for investigating and putting up a PR [~tejasp]! Have you been running any of your clusters with a patched version of Spark including that change, and has it been behaving as expected? The repro one of my users independently provided was this: {noformat} val rows = List(1, 2, 3, 4, 5, 6); val df1 = sc.parallelize(rows).toDF("col").repartition(1); val df2 = sc.parallelize(rows).toDF("col").repartition(2); val df3 = sc.parallelize(rows).toDF("col").repartition(2); val dd1 = df1.join(df2, df1.col("col").equalTo(df2.col("col"))).join(df3, df2.col("col").equalTo(df3.col("col"))); dd1.show; {noformat} > ReorderJoinPredicates can break when child's partitioning is not decided > > > Key: SPARK-22042 > URL: https://issues.apache.org/jira/browse/SPARK-22042 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Tejas Patil >Priority: Minor > > When `ReorderJoinPredicates` tries to get the `outputPartitioning` of its > children, the children may not be properly constructed as the child-subtree > has to still go through other planner rules. > In this particular case, the child is `SortMergeJoinExec`. Since the required > `Exchange` operators are not in place (because `EnsureRequirements` runs > _after_ `ReorderJoinPredicates`), the join's children would not have > partitioning defined. This breaks while creation the `PartitioningCollection` > here : > https://github.com/apache/spark/blob/94439997d57875838a8283c543f9b44705d3a503/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala#L69 > Small repro: > {noformat} > context.sql("SET spark.sql.autoBroadcastJoinThreshold=0") > val df = (0 until 50).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", > "k") > df.write.format("parquet").saveAsTable("table1") > df.write.format("parquet").saveAsTable("table2") > df.write.format("parquet").bucketBy(8, "j", "k").saveAsTable("bucketed_table") > sql(""" > SELECT * > FROM ( > SELECT a.i, a.j, a.k > FROM bucketed_table a > JOIN table1 b > ON a.i = b.i > ) c > JOIN table2 > ON c.i = table2.i > """).explain > {noformat} > This fails with : > {noformat} > java.lang.IllegalArgumentException: requirement failed: > PartitioningCollection requires all of its partitionings have the same > numPartitions. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.(partitioning.scala:324) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:69) > at > org.apache.spark.sql.execution.ProjectExec.outputPartitioning(basicPhysicalOperators.scala:82) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:91) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates$$anonfun$apply$1.applyOrElse(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:76) > at > org.apache.spark.sql.execution.joins.ReorderJoinPredicates.apply(ReorderJoinPredicates.scala:34) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:100) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:100) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$simpleString$1.apply(QueryExecution.scala:201) > at > org.apache.spark.sql.execution.QueryExecution.stringOrErro
[jira] [Commented] (SPARK-21991) [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load
[ https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219139#comment-16219139 ] Andrew Ash commented on SPARK-21991: Thanks for the contribution to Spark [~nivox]! I'll be testing this on some clusters of mine and will echo back here if I see any problems. Cheers! > [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine > has very high load > -- > > Key: SPARK-21991 > URL: https://issues.apache.org/jira/browse/SPARK-21991 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Single node machine running Ubuntu 16.04.2 LTS > (4.4.0-79-generic) > YARN 2.7.2 > Spark 2.0.2 >Reporter: Andrea Zito >Assignee: Andrea Zito >Priority: Minor > Fix For: 2.0.3, 2.1.3, 2.2.1, 2.3.0 > > > The way the _LauncherServer_ _acceptConnections_ thread schedules client > timeouts causes (non-deterministically) the thread to die with the following > exception if the machine is under very high load: > {noformat} > Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task > already scheduled or cancelled > at java.util.Timer.sched(Timer.java:401) > at java.util.Timer.schedule(Timer.java:193) > at > org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249) > at > org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80) > at > org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143) > {noformat} > The issue is related to the ordering of actions that the _acceptConnections_ > thread uses to handle a client connection: > # create timeout action > # create client thread > # start client thread > # schedule timeout action > Under normal conditions the scheduling of the timeout action happen before > the client thread has a chance to start, however if the machine is under very > high load the client thread can receive CPU time before the timeout action > gets scheduled. > If this condition happen, the client thread cancel the timeout action (which > is not yet been scheduled) and goes on, but as soon as the > _acceptConnections_ thread gets the CPU back, it will try to schedule the > timeout action (which has already been canceled) thus raising the exception. > Changing the order in which the client thread gets started and the timeout > gets scheduled seems to be sufficient to fix this issue. > As stated above the issue is non-deterministic, I faced the issue multiple > times on a single-node machine submitting a high number of short jobs > sequentially, but I couldn't easily create a test reproducing the issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21991) [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine has very high load
[ https://issues.apache.org/jira/browse/SPARK-21991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216529#comment-16216529 ] Andrew Ash commented on SPARK-21991: Thanks for debugging and diagnosing this [~nivox]! I'm seeing the same issue right now on one of my Spark clusters so am interested in getting your fix in to mainline Spark for my users. Have you deployed the change from your linked PR in a live setting, and has it fixed the issue for you? > [LAUNCHER] LauncherServer acceptConnections thread sometime dies if machine > has very high load > -- > > Key: SPARK-21991 > URL: https://issues.apache.org/jira/browse/SPARK-21991 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.2, 2.1.0, 2.1.1, 2.2.0 > Environment: Single node machine running Ubuntu 16.04.2 LTS > (4.4.0-79-generic) > YARN 2.7.2 > Spark 2.0.2 >Reporter: Andrea Zito >Priority: Minor > > The way the _LauncherServer_ _acceptConnections_ thread schedules client > timeouts causes (non-deterministically) the thread to die with the following > exception if the machine is under very high load: > {noformat} > Exception in thread "LauncherServer-1" java.lang.IllegalStateException: Task > already scheduled or cancelled > at java.util.Timer.sched(Timer.java:401) > at java.util.Timer.schedule(Timer.java:193) > at > org.apache.spark.launcher.LauncherServer.acceptConnections(LauncherServer.java:249) > at > org.apache.spark.launcher.LauncherServer.access$000(LauncherServer.java:80) > at > org.apache.spark.launcher.LauncherServer$1.run(LauncherServer.java:143) > {noformat} > The issue is related to the ordering of actions that the _acceptConnections_ > thread uses to handle a client connection: > # create timeout action > # create client thread > # start client thread > # schedule timeout action > Under normal conditions the scheduling of the timeout action happen before > the client thread has a chance to start, however if the machine is under very > high load the client thread can receive CPU time before the timeout action > gets scheduled. > If this condition happen, the client thread cancel the timeout action (which > is not yet been scheduled) and goes on, but as soon as the > _acceptConnections_ thread gets the CPU back, it will try to schedule the > timeout action (which has already been canceled) thus raising the exception. > Changing the order in which the client thread gets started and the timeout > gets scheduled seems to be sufficient to fix this issue. > As stated above the issue is non-deterministic, I faced the issue multiple > times on a single-node machine submitting a high number of short jobs > sequentially, but I couldn't easily create a test reproducing the issue. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22204) Explain output for SQL with commands shows no optimization
[ https://issues.apache.org/jira/browse/SPARK-22204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208676#comment-16208676 ] Andrew Ash commented on SPARK-22204: One way to work around this issue could be by getting the child of the command node and running explain on that. This does do the query planning twice though. See also discussion at https://github.com/apache/spark/pull/19269#discussion_r139841435 > Explain output for SQL with commands shows no optimization > -- > > Key: SPARK-22204 > URL: https://issues.apache.org/jira/browse/SPARK-22204 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Andrew Ash > > When displaying the explain output for a basic SELECT query, the query plan > changes as expected from analyzed -> optimized stages. But when putting that > same query into a command, for example {{CREATE TABLE}} it appears that the > optimization doesn't take place. > In Spark shell: > Explain output for a {{SELECT}} statement shows optimization: > {noformat} > scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) > AS b) AS c) AS d").explain(true) > == Parsed Logical Plan == > 'Project ['a] > +- 'SubqueryAlias d >+- 'Project ['a] > +- 'SubqueryAlias c > +- 'Project ['a] > +- SubqueryAlias b >+- Project [1 AS a#29] > +- OneRowRelation > == Analyzed Logical Plan == > a: int > Project [a#29] > +- SubqueryAlias d >+- Project [a#29] > +- SubqueryAlias c > +- Project [a#29] > +- SubqueryAlias b >+- Project [1 AS a#29] > +- OneRowRelation > == Optimized Logical Plan == > Project [1 AS a#29] > +- OneRowRelation > == Physical Plan == > *Project [1 AS a#29] > +- Scan OneRowRelation[] > scala> > {noformat} > But the same command run inside {{CREATE TABLE}} does not: > {noformat} > scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM > (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true) > == Parsed Logical Plan == > 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > Ignore > +- 'Project ['a] >+- 'SubqueryAlias d > +- 'Project ['a] > +- 'SubqueryAlias c > +- 'Project ['a] >+- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Analyzed Logical Plan == > CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, > InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Optimized Logical Plan == > CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, > InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > == Physical Plan == > CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand > [Database:default}, TableName: tmptable, InsertIntoHiveTable] >+- Project [a#33] > +- SubqueryAlias d > +- Project [a#33] > +- SubqueryAlias c >+- Project [a#33] > +- SubqueryAlias b > +- Project [1 AS a#33] > +- OneRowRelation > scala> > {noformat} > Note that there is no change between the analyzed and optimized plans when > run in a command. > This is misleading my users, as they think that there is no optimization > happening in the query! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202795#comment-16202795 ] Andrew Ash commented on SPARK-22269: [~sowen] you closed this as a duplicate. What issue is it a duplicate of? > Java style checks should be run in Jenkins > -- > > Key: SPARK-22269 > URL: https://issues.apache.org/jira/browse/SPARK-22269 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.3.0 >Reporter: Andrew Ash >Priority: Minor > > A few times now I've gone to build the master branch and it's failed due to > Java style errors, which I've sent in PRs to fix: > - https://issues.apache.org/jira/browse/SPARK-22268 > - https://issues.apache.org/jira/browse/SPARK-21875 > Digging through the history a bit, it looks like this check used to run on > Jenkins and was previously enabled at > https://github.com/apache/spark/pull/10763 but then reverted at > https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5 > We should work out what it takes to enable the Java check in Jenkins so these > kinds of errors are caught in CI rather than afterwards post-merge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22268) Fix java style errors
[ https://issues.apache.org/jira/browse/SPARK-22268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202793#comment-16202793 ] Andrew Ash commented on SPARK-22268: Any time {{./dev/run-tests}} is failing I consider that a bug. > Fix java style errors > - > > Key: SPARK-22268 > URL: https://issues.apache.org/jira/browse/SPARK-22268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Andrew Ash >Priority: Trivial > > {{./dev/lint-java}} fails on master right now with these exceptions: > {noformat} > [ERROR] > src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176] > (sizes) LineLength: Line is longer than 100 characters (found 112). > [ERROR] > src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] > src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178] > (sizes) LineLength: Line is longer than 100 characters (found 136). > [ERROR] > src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520] > (sizes) LineLength: Line is longer than 100 characters (found 104). > [ERROR] > src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524] > (sizes) LineLength: Line is longer than 100 characters (found 123). > [ERROR] > src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533] > (sizes) LineLength: Line is longer than 100 characters (found 120). > [ERROR] > src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535] > (sizes) LineLength: Line is longer than 100 characters (found 114). > [ERROR] > src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182] > (sizes) LineLength: Line is longer than 100 characters (found 116). > [ERROR] > src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8] > (imports) UnusedImports: Unused import - > org.apache.spark.sql.catalyst.expressions.Expression. > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22269) Java style checks should be run in Jenkins
Andrew Ash created SPARK-22269: -- Summary: Java style checks should be run in Jenkins Key: SPARK-22269 URL: https://issues.apache.org/jira/browse/SPARK-22269 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.3.0 Reporter: Andrew Ash A few times now I've gone to build the master branch and it's failed due to Java style errors, which I've sent in PRs to fix: - https://issues.apache.org/jira/browse/SPARK-22268 - https://issues.apache.org/jira/browse/SPARK-21875 Digging through the history a bit, it looks like this check used to run on Jenkins and was previously enabled at https://github.com/apache/spark/pull/10763 but then reverted at https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5 We should work out what it takes to enable the Java check in Jenkins so these kinds of errors are caught in CI rather than afterwards post-merge. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22268) Fix java style errors
Andrew Ash created SPARK-22268: -- Summary: Fix java style errors Key: SPARK-22268 URL: https://issues.apache.org/jira/browse/SPARK-22268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Andrew Ash {{./dev/lint-java}} fails on master right now with these exceptions: {noformat} [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[176] (sizes) LineLength: Line is longer than 100 characters (found 112). [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[177] (sizes) LineLength: Line is longer than 100 characters (found 106). [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[178] (sizes) LineLength: Line is longer than 100 characters (found 136). [ERROR] src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[520] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[524] (sizes) LineLength: Line is longer than 100 characters (found 123). [ERROR] src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[533] (sizes) LineLength: Line is longer than 100 characters (found 120). [ERROR] src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorterSuite.java:[535] (sizes) LineLength: Line is longer than 100 characters (found 114). [ERROR] src/test/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorterSuite.java:[182] (sizes) LineLength: Line is longer than 100 characters (found 116). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsPushDownFilters.java:[21,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.catalyst.expressions.Expression. {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200831#comment-16200831 ] Andrew Ash commented on SPARK-18359: I agree with Sean -- using the submitting JVM's locale is insufficient. Some users will want to join CSVs with different locales, and that doesn't offer a means of using different locales on different files. This really should be specified on a per-csv basis, which is why I think the direction [~abicz] suggested supports this nicely: {noformat} spark.read.option("locale", "value").csv("filefromeurope.csv") {noformat} where the value is something we can use to get a {{java.util.Locale}} Has anyone worked around this functionality gap in a useful way, or do we still need to work through these Locale issues in spark-csv? > Let user specify locale in CSV parsing > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20055) Documentation for CSV datasets in SQL programming guide
[ https://issues.apache.org/jira/browse/SPARK-20055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193355#comment-16193355 ] Andrew Ash commented on SPARK-20055: What I would find most useful is a list of available options and their behaviors, like https://github.com/databricks/spark-csv#features Even though that project has been in-lined into Apache Spark, that github page is still the best reference for csv options > Documentation for CSV datasets in SQL programming guide > --- > > Key: SPARK-20055 > URL: https://issues.apache.org/jira/browse/SPARK-20055 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > I guess things commonly used and important are documented there rather than > documenting everything and every option in the programming guide - > http://spark.apache.org/docs/latest/sql-programming-guide.html. > It seems JSON datasets > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > are documented whereas CSV datasets are not. > Nowadays, they are pretty similar in APIs and options. Some options are > notable for both, In particular, ones such as {{wholeFile}}. Moreover, > several options such as {{inferSchema}} and {{header}} are important in CSV > that affect the type/column name of data. > In that sense, I think we might better document CSV datasets with some > examples too because I believe reading CSV is pretty much common use cases. > Also, I think we could also leave some pointers for options of API > documentations for both (rather than duplicating the documentation). > So, my suggestion is, > - Add CSV Datasets section. > - Add links for options for both JSON and CSV that point each API > documentation > - Fix trivial minor fixes together in both sections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22204) Explain output for SQL with commands shows no optimization
Andrew Ash created SPARK-22204: -- Summary: Explain output for SQL with commands shows no optimization Key: SPARK-22204 URL: https://issues.apache.org/jira/browse/SPARK-22204 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Andrew Ash When displaying the explain output for a basic SELECT query, the query plan changes as expected from analyzed -> optimized stages. But when putting that same query into a command, for example {{CREATE TABLE}} it appears that the optimization doesn't take place. In Spark shell: Explain output for a {{SELECT}} statement shows optimization: {noformat} scala> spark.sql("SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true) == Parsed Logical Plan == 'Project ['a] +- 'SubqueryAlias d +- 'Project ['a] +- 'SubqueryAlias c +- 'Project ['a] +- SubqueryAlias b +- Project [1 AS a#29] +- OneRowRelation == Analyzed Logical Plan == a: int Project [a#29] +- SubqueryAlias d +- Project [a#29] +- SubqueryAlias c +- Project [a#29] +- SubqueryAlias b +- Project [1 AS a#29] +- OneRowRelation == Optimized Logical Plan == Project [1 AS a#29] +- OneRowRelation == Physical Plan == *Project [1 AS a#29] +- Scan OneRowRelation[] scala> {noformat} But the same command run inside {{CREATE TABLE}} does not: {noformat} scala> spark.sql("CREATE TABLE IF NOT EXISTS tmptable AS SELECT a FROM (SELECT a FROM (SELECT a FROM (SELECT 1 AS a) AS b) AS c) AS d").explain(true) == Parsed Logical Plan == 'CreateTable `tmptable`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore +- 'Project ['a] +- 'SubqueryAlias d +- 'Project ['a] +- 'SubqueryAlias c +- 'Project ['a] +- SubqueryAlias b +- Project [1 AS a#33] +- OneRowRelation == Analyzed Logical Plan == CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, InsertIntoHiveTable] +- Project [a#33] +- SubqueryAlias d +- Project [a#33] +- SubqueryAlias c +- Project [a#33] +- SubqueryAlias b +- Project [1 AS a#33] +- OneRowRelation == Optimized Logical Plan == CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, InsertIntoHiveTable] +- Project [a#33] +- SubqueryAlias d +- Project [a#33] +- SubqueryAlias c +- Project [a#33] +- SubqueryAlias b +- Project [1 AS a#33] +- OneRowRelation == Physical Plan == CreateHiveTableAsSelectCommand CreateHiveTableAsSelectCommand [Database:default}, TableName: tmptable, InsertIntoHiveTable] +- Project [a#33] +- SubqueryAlias d +- Project [a#33] +- SubqueryAlias c +- Project [a#33] +- SubqueryAlias b +- Project [1 AS a#33] +- OneRowRelation scala> {noformat} Note that there is no change between the analyzed and optimized plans when run in a command. This is misleading my users, as they think that there is no optimization happening in the query! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash reopened SPARK-18016: // reopening issue One PR addressing this bug has been merged -- https://github.com/apache/spark/pull/18075 -- but the second PR hasn't gone in yet: https://github.com/apache/spark/pull/16648 I'm still observing the error message on a dataset with ~3000 columns: {noformat} java.lang.RuntimeException: Error while encoding: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2 has grown past JVM limit of 0x {noformat} even when running with the first PR, and [~jamcon] reported similarly at https://issues.apache.org/jira/browse/SPARK-18016?focusedCommentId=16103853&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16103853 > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.code
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178459#comment-16178459 ] Andrew Ash commented on SPARK-19700: There was a thread on the dev list recently about Apache Aurora: http://aurora.apache.org/ > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22112) Add missing method to pyspark api: spark.read.csv(Dataset)
Andrew Ash created SPARK-22112: -- Summary: Add missing method to pyspark api: spark.read.csv(Dataset) Key: SPARK-22112 URL: https://issues.apache.org/jira/browse/SPARK-22112 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: Andrew Ash https://issues.apache.org/jira/browse/SPARK-15463 added a method to the scala API without adding an equivalent in pyspark: {{spark.read.csv(Dataset)}} I was writing some things with pyspark but had to switch it to scala/java to use that method -- since equivalency between python/java/scala is a Spark goal, we should make sure this functionality exists in all the supported languages. https://github.com/apache/spark/pull/16854/files#diff-f70bda59304588cc3abfa3a9840653f4R408 cc [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21962) Distributed Tracing in Spark
Andrew Ash created SPARK-21962: -- Summary: Distributed Tracing in Spark Key: SPARK-21962 URL: https://issues.apache.org/jira/browse/SPARK-21962 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash Spark should support distributed tracing, which is the mechanism, widely popularized by Google in the [Dapper Paper|https://research.google.com/pubs/pub36356.html], where network requests have additional metadata used for tracing requests between services. This would be useful for me since I have OpenZipkin style tracing in my distributed application up to the Spark driver, and from the executors out to my other services, but the link is broken in Spark between driver and executor since the Span IDs aren't propagated across that link. An initial implementation could instrument the most important network calls with trace ids (like launching and finishing tasks), and incrementally add more tracing to other calls (torrent block distribution, external shuffle service, etc) as the feature matures. Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21953) Show both memory and disk bytes spilled if either is present
Andrew Ash created SPARK-21953: -- Summary: Show both memory and disk bytes spilled if either is present Key: SPARK-21953 URL: https://issues.apache.org/jira/browse/SPARK-21953 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0 Reporter: Andrew Ash Priority: Minor https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61 should be {{||}} not {{&&}} As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156622#comment-16156622 ] Andrew Ash commented on SPARK-12449: [~velvia] I'm not involved with the CatalystSource or SAP HANAVora, so can't comment on the direction that project is going right now. However there is an effort to add a new Datasources V2 API happening at https://issues.apache.org/jira/browse/SPARK-15689 and on the email list right now that could grow to encompass the goals of this issue. [~stephank85] if you are able to comment on SPARK-15689 your input would be very valuable to that API design. > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21941) Stop storing unused attemptId in SQLTaskMetrics
Andrew Ash created SPARK-21941: -- Summary: Stop storing unused attemptId in SQLTaskMetrics Key: SPARK-21941 URL: https://issues.apache.org/jira/browse/SPARK-21941 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0 Reporter: Andrew Ash Currently SQLTaskMetrics has a long attemptId field on it that is unused, with a TODO saying to populate the value in the future. We should save this memory by leaving the TODO but taking the unused field out of the class. I have a driver that heap dumped on OOM and has 390,105 instances of SQLTaskMetric -- removing this 8 bytes field will save roughly 390k*8 = 3.1MB of heap space. It's not going to fix my OOM, but there's no reason to put this pressure on the GC if we don't get anything by storing it. https://github.com/apache/spark/blob/v2.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/ui/SQLListener.scala#L485 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21807) The getAliasedConstraints function in LogicalPlan will take a long time when number of expressions is greater than 100
[ https://issues.apache.org/jira/browse/SPARK-21807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149600#comment-16149600 ] Andrew Ash commented on SPARK-21807: For reference, here's a stacktrace I'm seeing on a cluster before this change that I think this PR will improve: {noformat} "spark-task-4" #714 prio=5 os_prio=0 tid=0x7fa368031000 nid=0x4d91 runnable [0x7fa24e592000] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:220) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.Add.equals(arithmetic.scala:149) at org.apache.spark.sql.catalyst.expressions.EqualNullSafe.equals(predicates.scala:505) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) at scala.collection.mutable.FlatHashTable$class.growTable(FlatHashTable.scala:225) at scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:159) at scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) at scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139) at scala.collection.mutable.HashSet.addElem(HashSet.scala:40) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59) at scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46) at scala.collection.mutable.HashSet.clone(HashSet.scala:83) at scala.collection.mutable.HashSet.clone(HashSet.scala:40) at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65) at org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50) at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) at scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157) at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) at scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104) at scala.collection.TraversableOnce$class.$div$colon(TraversableOn
[jira] [Commented] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
[ https://issues.apache.org/jira/browse/SPARK-21875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147976#comment-16147976 ] Andrew Ash commented on SPARK-21875: I'd be interested in more details on why it can't be run in the PR builder -- I have the full `./dev/run-tests` running in CI and it catches things like this occasionally PR at https://github.com/apache/spark/pull/19088 > Jenkins passes Java code that violates ./dev/lint-java > -- > > Key: SPARK-21875 > URL: https://issues.apache.org/jira/browse/SPARK-21875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Trivial > > Two recent PRs merged which caused lint-java errors: > {noformat} > > Running Java style checks > > Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn > Checkstyle checks failed at following occurrences: > [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] > (sizes) LineLength: Line is longer than 100 characters (found 106). > [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 > {noformat} > The first error is from https://github.com/apache/spark/pull/19025 and the > second is from https://github.com/apache/spark/pull/18488 > Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21875) Jenkins passes Java code that violates ./dev/lint-java
Andrew Ash created SPARK-21875: -- Summary: Jenkins passes Java code that violates ./dev/lint-java Key: SPARK-21875 URL: https://issues.apache.org/jira/browse/SPARK-21875 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.0 Reporter: Andrew Ash Two recent PRs merged which caused lint-java errors: {noformat} Running Java style checks Using `mvn` from path: /home/ubuntu/spark/build/apache-maven-3.5.0/bin/mvn Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/memory/TaskMemoryManager.java:[77] (sizes) LineLength: Line is longer than 100 characters (found 106). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[1340] (sizes) LineLength: Line is longer than 100 characters (found 106). [error] running /home/ubuntu/spark/dev/lint-java ; received return code 1 {noformat} The first error is from https://github.com/apache/spark/pull/19025 and the second is from https://github.com/apache/spark/pull/18488 Should we be expecting Jenkins to enforce Java code style pre-commit? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15689) Data source API v2
[ https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138509#comment-16138509 ] Andrew Ash commented on SPARK-15689: Can the authors of this document add a section contrasting the approach with the one from https://issues.apache.org/jira/browse/SPARK-12449 ? In that approach the data source receives an entire arbitrary plan rather than just parts of the plan. > Data source API v2 > -- > > Key: SPARK-15689 > URL: https://issues.apache.org/jira/browse/SPARK-15689 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > Attachments: SPIP Data Source API V2.pdf > > > This ticket tracks progress in creating the v2 of data source API. This new > API should focus on: > 1. Have a small surface so it is easy to freeze and maintain compatibility > for a long time. Ideally, this API should survive architectural rewrites and > user-facing API revamps of Spark. > 2. Have a well-defined column batch interface for high performance. > Convenience methods should exist to convert row-oriented formats into column > batches for data source developers. > 3. Still support filter push down, similar to the existing API. > 4. Nice-to-have: support additional common operators, including limit and > sampling. > Note that both 1 and 2 are problems that the current data source API (v1) > suffers. The current data source API has a wide surface with dependency on > DataFrame/SQLContext, making the data source API compatibility depending on > the upper level API. The current data source API is also only row oriented > and has to go through an expensive external data type conversion to internal > data type. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138501#comment-16138501 ] Andrew Ash commented on SPARK-12449: Relevant slides: https://www.slideshare.net/SparkSummit/the-pushdown-of-everything-by-stephan-kessler-and-santiago-mola > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final
[ https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136371#comment-16136371 ] Andrew Ash commented on SPARK-19552: I didn't see anything other than the issue you just commented on at https://issues.apache.org/jira/browse/ARROW-292 > Upgrade Netty version to 4.1.8 final > > > Key: SPARK-19552 > URL: https://issues.apache.org/jira/browse/SPARK-19552 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Netty 4.1.8 was recently released but isn't API compatible with previous > major versions (like Netty 4.0.x), see > http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details. > This version does include a fix for a security concern but not one we'd be > exposed to with Spark "out of the box". Let's upgrade the version we use to > be on the safe side as the security fix I'm especially interested in is not > available in the 4.0.x release line. > We should move up anyway to take on a bunch of other big fixes cited in the > release notes (and if anyone were to use Spark with netty and tcnative, they > shouldn't be exposed to the security problem) - we should be good citizens > and make this change. > As this 4.1 version involves API changes we'll need to implement a few > methods and possibly adjust the Sasl tests. This JIRA and associated pull > request starts the process which I'll work on - and any help would be much > appreciated! Currently I know: > {code} > @Override > public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise > promise) > throws Exception { > if (!foundEncryptionHandler) { > foundEncryptionHandler = > ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this > returns false and causes test failures > } > ctx.write(msg, promise); > } > {code} > Here's what changes will be required (at least): > {code} > common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code} > requires touch, retain and transferred methods > {code} > common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code} > requires the above methods too > {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code} > With "dummy" implementations so we can at least compile and test, we'll see > five new test failures to address. > These are > {code} > org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption > org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption > org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19552) Upgrade Netty version to 4.1.8 final
[ https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134657#comment-16134657 ] Andrew Ash commented on SPARK-19552: Heads up the next time someone attempts this: Upgrading to 4.1.x causes a few Apache Arrow-related integration test failures in Spark, because Arrow depends on a part of Netty 4.0.x that has changed in the 4.1.x series. I've been running with Netty 4.1.x on my fork for a few months but due to recent Arrow changes will now have to downgrade back to Netty 4.0.x due to this Arrow dependency. More details at https://github.com/palantir/spark/pull/247#issuecomment-323469174 So when Spark does go to 4.1.x we might need Arrow to bump its dependencies as well (or shade netty in Spark at the same time as Sean suggests, which I generally support). > Upgrade Netty version to 4.1.8 final > > > Key: SPARK-19552 > URL: https://issues.apache.org/jira/browse/SPARK-19552 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Netty 4.1.8 was recently released but isn't API compatible with previous > major versions (like Netty 4.0.x), see > http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details. > This version does include a fix for a security concern but not one we'd be > exposed to with Spark "out of the box". Let's upgrade the version we use to > be on the safe side as the security fix I'm especially interested in is not > available in the 4.0.x release line. > We should move up anyway to take on a bunch of other big fixes cited in the > release notes (and if anyone were to use Spark with netty and tcnative, they > shouldn't be exposed to the security problem) - we should be good citizens > and make this change. > As this 4.1 version involves API changes we'll need to implement a few > methods and possibly adjust the Sasl tests. This JIRA and associated pull > request starts the process which I'll work on - and any help would be much > appreciated! Currently I know: > {code} > @Override > public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise > promise) > throws Exception { > if (!foundEncryptionHandler) { > foundEncryptionHandler = > ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this > returns false and causes test failures > } > ctx.write(msg, promise); > } > {code} > Here's what changes will be required (at least): > {code} > common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code} > requires touch, retain and transferred methods > {code} > common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code} > requires the above methods too > {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code} > With "dummy" implementations so we can at least compile and test, we'll see > five new test failures to address. > These are > {code} > org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption > org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption > org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21757) Jobs page fails to load when executor removed event's reason contains single quote
[ https://issues.apache.org/jira/browse/SPARK-21757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-21757: --- Description: At the following two places if the {{e.reason}} value contains a single quote character, then the rendered JSON is invalid. https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158 https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127 A quick fix could be to do replacement of {{'}} with html entity {noformat}'{noformat} and a longer term fix could be to not assemble JSON via string concatenation, and instead use a real JSON serialization library in those two places. was: At the following two places if the {{e.reason}} value contains a single quote character, then the rendered JSON is invalid. https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158 https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127 A quick fix could be to do replacement of {{'}} with html entity {{'}} and a longer term fix could be to not assemble JSON via string concatenation, and instead use a real JSON serialization library in those two places. > Jobs page fails to load when executor removed event's reason contains single > quote > -- > > Key: SPARK-21757 > URL: https://issues.apache.org/jira/browse/SPARK-21757 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > At the following two places if the {{e.reason}} value contains a single quote > character, then the rendered JSON is invalid. > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158 > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127 > A quick fix could be to do replacement of {{'}} with html entity > {noformat}'{noformat} and a longer term fix could be to not assemble JSON > via string concatenation, and instead use a real JSON serialization library > in those two places. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21757) Jobs page fails to load when executor removed event's reason contains single quote
Andrew Ash created SPARK-21757: -- Summary: Jobs page fails to load when executor removed event's reason contains single quote Key: SPARK-21757 URL: https://issues.apache.org/jira/browse/SPARK-21757 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0 Reporter: Andrew Ash At the following two places if the {{e.reason}} value contains a single quote character, then the rendered JSON is invalid. https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/AllJobsPage.scala?utf8=%E2%9C%93#L158 https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/ui/jobs/JobPage.scala#L127 A quick fix could be to do replacement of {{'}} with html entity {{'}} and a longer term fix could be to not assemble JSON via string concatenation, and instead use a real JSON serialization library in those two places. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122651#comment-16122651 ] Andrew Ash commented on SPARK-21564: [~irashid] a possible fix could look roughly like this: {noformat} diff --git a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala index a2f1aa22b0..06d72fe106 100644 --- a/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala +++ b/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala @@ -17,6 +17,7 @@ package org.apache.spark.executor +import java.io.{DataInputStream, NotSerializableException} import java.net.URL import java.nio.ByteBuffer import java.util.Locale @@ -35,7 +36,7 @@ import org.apache.spark.rpc._ import org.apache.spark.scheduler.{ExecutorLossReason, TaskDescription} import org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages._ import org.apache.spark.serializer.SerializerInstance -import org.apache.spark.util.{ThreadUtils, Utils} +import org.apache.spark.util.{ByteBufferInputStream, ThreadUtils, Utils} private[spark] class CoarseGrainedExecutorBackend( override val rpcEnv: RpcEnv, @@ -93,9 +94,26 @@ private[spark] class CoarseGrainedExecutorBackend( if (executor == null) { exitExecutor(1, "Received LaunchTask command but executor was null") } else { -val taskDesc = TaskDescription.decode(data.value) -logInfo("Got assigned task " + taskDesc.taskId) -executor.launchTask(this, taskDesc) +try { + val taskDesc = TaskDescription.decode(data.value) + logInfo("Got assigned task " + taskDesc.taskId) + executor.launchTask(this, taskDesc) +} catch { + case e: Exception => +val taskId = new DataInputStream(new ByteBufferInputStream( + ByteBuffer.wrap(data.value.array(.readLong() +val ser = env.closureSerializer.newInstance() +val serializedTaskEndReason = { + try { +ser.serialize(new ExceptionFailure(e, Nil)) + } catch { +case _: NotSerializableException => + // e is not serializable so just send the stacktrace + ser.serialize(new ExceptionFailure(e, Nil, false)) + } +} +statusUpdate(taskId, TaskState.FAILED, serializedTaskEndReason) +} } case KillTask(taskId, _, interruptThread, reason) => {noformat} The downside here though is that we're still making the assumption that the TaskDescription is well-formatted enough to be able to get the taskId out of it (the first long in the serialized bytes). Any other thoughts on how to do this? > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > For details on the cause of that exception, see SPARK-21563 > We've since changed the application and have a proposed fix in Spark at the > ticket above, but it was troubling that decodi
[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars
[ https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122549#comment-16122549 ] Andrew Ash commented on SPARK-21563: Thanks for the thoughts [~irashid] -- I submitted a PR implementing this approach at https://github.com/apache/spark/pull/18913 > Race condition when serializing TaskDescriptions and adding jars > > > Key: SPARK-21563 > URL: https://issues.apache.org/jira/browse/SPARK-21563 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing this exception during some running Spark jobs: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > After some debugging, we determined that this is due to a race condition in > task serde. cc [~irashid] [~kayousterhout] who last touched that code in > SPARK-19796 > The race is between adding additional jars to the SparkContext and > serializing the TaskDescription. > Consider this sequence of events: > - TaskSetManager creates a TaskDescription using a reference to the > SparkContext's jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506 > - TaskDescription starts serializing, and begins writing jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84 > - the size of the jar map is written out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63 > - _on another thread_: the application adds a jar to the SparkContext's jars > list > - then the entries in the jars list are serialized out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64 > The problem now is that the jars list is serialized as having N entries, but > actually N+1 entries follow that count! > This causes task deserialization to fail in the executor, with the stacktrace > above. > The same issue also likely exists for files, though I haven't observed that > and our application does not stress that codepath the same way it did for jar > additions. > One fix here is that TaskSetManager could make an immutable copy of the jars > list that it passes into the TaskDescription constructor, so that list > doesn't change mid-serialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
[ https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120643#comment-16120643 ] Andrew Ash commented on SPARK-19116: Ah yes, for files it seems like Spark currently uses size of the parquet files on disk, rather than estimating in-memory size by multiplying the sum of the column type sizes by the row count. > LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file > - > > Key: SPARK-19116 > URL: https://issues.apache.org/jira/browse/SPARK-19116 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.0.2 > Environment: Python 3.5.x > Windows 10 >Reporter: Shea Parkes > > We're having some modestly severe issues with broadcast join inference, and > I've been chasing them through the join heuristics in the catalyst engine. > I've made it as far as I can, and I've hit upon something that does not make > any sense to me. > I thought that loading from parquet would be a RelationPlan, which would just > use the sum of default sizeInBytes for each column times the number of rows. > But this trivial example shows that I am not correct: > {code} > import pyspark.sql.functions as F > df_range = session.range(100).select(F.col('id').cast('integer')) > df_range.write.parquet('c:/scratch/hundred_integers.parquet') > df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet') > df_parquet.explain(True) > # Expected sizeInBytes > integer_default_sizeinbytes = 4 > print(df_parquet.count() * integer_default_sizeinbytes) # = 400 > # Inferred sizeInBytes > print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes()) # = 2318 > # For posterity (Didn't really expect this to match anything above) > print(df_range._jdf.logicalPlan().statistics().sizeInBytes()) # = 600 > {code} > And here's the results of explain(True) on df_parquet: > {code} > In [456]: == Parsed Logical Plan == > Relation[id#794] parquet > == Analyzed Logical Plan == > id: int > Relation[id#794] parquet > == Optimized Logical Plan == > Relation[id#794] parquet > == Physical Plan == > *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: > file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > So basically, I'm not understanding well how the size of the parquet file is > being estimated. I don't expect it to be extremely accurate, but empirically > it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold > way too much. (It's not always too high like the example above, it's often > way too low.) > Without deeper understanding, I'm considering a result of 2318 instead of 400 > to be a bug. My apologies if I'm missing something obvious. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
[ https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-19116. -- Resolution: Not A Problem > LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file > - > > Key: SPARK-19116 > URL: https://issues.apache.org/jira/browse/SPARK-19116 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.0.2 > Environment: Python 3.5.x > Windows 10 >Reporter: Shea Parkes > > We're having some modestly severe issues with broadcast join inference, and > I've been chasing them through the join heuristics in the catalyst engine. > I've made it as far as I can, and I've hit upon something that does not make > any sense to me. > I thought that loading from parquet would be a RelationPlan, which would just > use the sum of default sizeInBytes for each column times the number of rows. > But this trivial example shows that I am not correct: > {code} > import pyspark.sql.functions as F > df_range = session.range(100).select(F.col('id').cast('integer')) > df_range.write.parquet('c:/scratch/hundred_integers.parquet') > df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet') > df_parquet.explain(True) > # Expected sizeInBytes > integer_default_sizeinbytes = 4 > print(df_parquet.count() * integer_default_sizeinbytes) # = 400 > # Inferred sizeInBytes > print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes()) # = 2318 > # For posterity (Didn't really expect this to match anything above) > print(df_range._jdf.logicalPlan().statistics().sizeInBytes()) # = 600 > {code} > And here's the results of explain(True) on df_parquet: > {code} > In [456]: == Parsed Logical Plan == > Relation[id#794] parquet > == Analyzed Logical Plan == > id: int > Relation[id#794] parquet > == Optimized Logical Plan == > Relation[id#794] parquet > == Physical Plan == > *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: > file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > So basically, I'm not understanding well how the size of the parquet file is > being estimated. I don't expect it to be extremely accurate, but empirically > it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold > way too much. (It's not always too high like the example above, it's often > way too low.) > Without deeper understanding, I'm considering a result of 2318 instead of 400 > to be a bug. My apologies if I'm missing something obvious. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19116) LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file
[ https://issues.apache.org/jira/browse/SPARK-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114751#comment-16114751 ] Andrew Ash commented on SPARK-19116: [~shea.parkes] does this answer your question? > LogicalPlan.statistics.sizeInBytes wrong for trivial parquet file > - > > Key: SPARK-19116 > URL: https://issues.apache.org/jira/browse/SPARK-19116 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.0.2 > Environment: Python 3.5.x > Windows 10 >Reporter: Shea Parkes > > We're having some modestly severe issues with broadcast join inference, and > I've been chasing them through the join heuristics in the catalyst engine. > I've made it as far as I can, and I've hit upon something that does not make > any sense to me. > I thought that loading from parquet would be a RelationPlan, which would just > use the sum of default sizeInBytes for each column times the number of rows. > But this trivial example shows that I am not correct: > {code} > import pyspark.sql.functions as F > df_range = session.range(100).select(F.col('id').cast('integer')) > df_range.write.parquet('c:/scratch/hundred_integers.parquet') > df_parquet = session.read.parquet('c:/scratch/hundred_integers.parquet') > df_parquet.explain(True) > # Expected sizeInBytes > integer_default_sizeinbytes = 4 > print(df_parquet.count() * integer_default_sizeinbytes) # = 400 > # Inferred sizeInBytes > print(df_parquet._jdf.logicalPlan().statistics().sizeInBytes()) # = 2318 > # For posterity (Didn't really expect this to match anything above) > print(df_range._jdf.logicalPlan().statistics().sizeInBytes()) # = 600 > {code} > And here's the results of explain(True) on df_parquet: > {code} > In [456]: == Parsed Logical Plan == > Relation[id#794] parquet > == Analyzed Logical Plan == > id: int > Relation[id#794] parquet > == Optimized Logical Plan == > Relation[id#794] parquet > == Physical Plan == > *BatchedScan parquet [id#794] Format: ParquetFormat, InputPaths: > file:/c:/scratch/hundred_integers.parquet, PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} > So basically, I'm not understanding well how the size of the parquet file is > being estimated. I don't expect it to be extremely accurate, but empirically > it's so inaccurate that we're having to mess with autoBroadcastJoinThreshold > way too much. (It's not always too high like the example above, it's often > way too low.) > Without deeper understanding, I'm considering a result of 2318 instead of 400 > to be a bug. My apologies if I'm missing something obvious. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108117#comment-16108117 ] Andrew Ash commented on SPARK-20433: Sorry about not updating the ticket description -- the 2.6.7.1 release was came after the initial batch of patches for projects still on older versions of Jackson. I've now opened PR https://github.com/apache/spark/pull/18789 where we can further discussion on the risk of the version bump. > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a > patch as well: 2.6.7.1 as mentioned at > https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this > library for the next Spark release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Description: There was a security vulnerability recently reported to the upstream jackson-databind project at https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix released. >From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the >first fixed versions in their respectful 2.X branches, and versions in the >2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a >patch as well: 2.6.7.1 as mentioned at >https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 Right now Spark master branch is on 2.6.5: https://github.com/apache/spark/blob/master/pom.xml#L164 and Hadoop branch-2.7 is on 2.2.3: https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 and Hadoop branch-3.0.0-alpha2 is on 2.7.8: https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this library for the next Spark release. was: There was a security vulnerability recently reported to the upstream jackson-databind project at https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix released. >From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the >first fixed versions in their respectful 2.X branches, and versions in the >2.6.X line and earlier remain vulnerable. Right now Spark master branch is on 2.6.5: https://github.com/apache/spark/blob/master/pom.xml#L164 and Hadoop branch-2.7 is on 2.2.3: https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 and Hadoop branch-3.0.0-alpha2 is on 2.7.8: https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 We should try to find to find a way to get on a patched version of jackson-bind for the Spark 2.2.0 release. > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a > patch as well: 2.6.7.1 as mentioned at > https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this > library for the next Spark release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Summary: Update jackson-databind to 2.6.7.1 (was: Update jackson-databind to 2.6.7) > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20433) Update jackson-databind to 2.6.7
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash reopened SPARK-20433: > Update jackson-databind to 2.6.7 > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105951#comment-16105951 ] Andrew Ash commented on SPARK-20433: As I wrote in that PR, it's 2.6.7.1 of jackson-databind that has the fix, and the Jackson project did not publish a corresponding 2.6.7.1 of the other components of Jackson. This affects Spark because a known vulnerable library is on the classpath at runtime. So you can only guarantee that Spark isn't vulnerable by removing the vulnerable code from the runtime classpath. Anyways a Jackson bump to a fixed version will likely be picked up by Apache Spark the next time Jackson is upgraded so I trust this will get fixed eventually regardless of whether Apache takes the hotfix version now or a regular release in the future. > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars
[ https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105772#comment-16105772 ] Andrew Ash commented on SPARK-21563: And for reference, I added this additional logging to assist in debugging: https://github.com/palantir/spark/pull/238 > Race condition when serializing TaskDescriptions and adding jars > > > Key: SPARK-21563 > URL: https://issues.apache.org/jira/browse/SPARK-21563 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing this exception during some running Spark jobs: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring error > java.io.EOFException: null > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readUTF(DataInputStream.java:609) > at java.io.DataInputStream.readUTF(DataInputStream.java:564) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) > at > org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) > at > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) > at > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:748) > {noformat} > After some debugging, we determined that this is due to a race condition in > task serde. cc [~irashid] [~kayousterhout] who last touched that code in > SPARK-19796 > The race is between adding additional jars to the SparkContext and > serializing the TaskDescription. > Consider this sequence of events: > - TaskSetManager creates a TaskDescription using a reference to the > SparkContext's jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506 > - TaskDescription starts serializing, and begins writing jars: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84 > - the size of the jar map is written out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63 > - _on another thread_: the application adds a jar to the SparkContext's jars > list > - then the entries in the jars list are serialized out: > https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64 > The problem now is that the jars list is serialized as having N entries, but > actually N+1 entries follow that count! > This causes task deserialization to fail in the executor, with the stacktrace > above. > The same issue also likely exists for files, though I haven't observed that > and our application does not stress that codepath the same way it did for jar > additions. > One fix here is that TaskSetManager could make an immutable copy of the jars > list that it passes into the TaskDescription constructor, so that list > doesn't change mid-serialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16105751#comment-16105751 ] Andrew Ash commented on SPARK-20433: Here's the patch I put in my fork of Spark: https://github.com/palantir/spark/pull/241 It addresses CVE-2017-7525 -- http://www.securityfocus.com/bid/99623 > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task
[ https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-21564: --- Description: cc [~robert3005] I was seeing an issue where Spark was throwing this exception: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} For details on the cause of that exception, see SPARK-21563 We've since changed the application and have a proposed fix in Spark at the ticket above, but it was troubling that decoding the TaskDescription wasn't failing the tasks. So the Spark job ended up hanging and making no progress, permanently stuck, because the driver thinks the task is running but the thread has died in the executor. We should make a change around https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 so that when that decode throws an exception, the task is marked as failed. cc [~kayousterhout] [~irashid] was: I was seeing an issue where Spark was throwing this exception: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} For details on the cause of that exception, see SPARK-21563 We've since changed the application and have a proposed fix in Spark at the ticket above, but it was troubling that decoding the TaskDescription wasn't failing the tasks. So the Spark job ended up hanging and making no progress, permanently stuck, because the driver thinks the task is running but the thread has died in the executor. We should make a change around https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 so that when that decode throws an exception, the task is marked as failed. cc [~kayousterhout] [~irashid] > TaskDescription decoding failure should fail the task > - > > Key: SPARK-21564 > URL: https://issues.apache.org/jira/browse/SPARK-21564 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > cc [~robert3005] > I was seeing an issue where Spark was throwing this exception: > {noformat} > 16:16:28.294 [dispatcher-event-loop-14] ERROR > org.apache.spark.rpc.netty.Inbox - Ignoring
[jira] [Updated] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars
[ https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-21563: --- Description: cc [~robert3005] I was seeing this exception during some running Spark jobs: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} After some debugging, we determined that this is due to a race condition in task serde. cc [~irashid] [~kayousterhout] who last touched that code in SPARK-19796 The race is between adding additional jars to the SparkContext and serializing the TaskDescription. Consider this sequence of events: - TaskSetManager creates a TaskDescription using a reference to the SparkContext's jars: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506 - TaskDescription starts serializing, and begins writing jars: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84 - the size of the jar map is written out: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63 - _on another thread_: the application adds a jar to the SparkContext's jars list - then the entries in the jars list are serialized out: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64 The problem now is that the jars list is serialized as having N entries, but actually N+1 entries follow that count! This causes task deserialization to fail in the executor, with the stacktrace above. The same issue also likely exists for files, though I haven't observed that and our application does not stress that codepath the same way it did for jar additions. One fix here is that TaskSetManager could make an immutable copy of the jars list that it passes into the TaskDescription constructor, so that list doesn't change mid-serialization. was: cc [~robert3005] I was seeing this exception during some running Spark jobs: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} After some debugging, we determined that this is due to a race condition in task serde introduced in SPARK-19796. cc [~irashid] [~kayousterhout] The race is between adding additional jars to the SparkContext and serializing the Tas
[jira] [Created] (SPARK-21564) TaskDescription decoding failure should fail the task
Andrew Ash created SPARK-21564: -- Summary: TaskDescription decoding failure should fail the task Key: SPARK-21564 URL: https://issues.apache.org/jira/browse/SPARK-21564 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash I was seeing an issue where Spark was throwing this exception: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} For details on the cause of that exception, see SPARK-21563 We've since changed the application and have a proposed fix in Spark at the ticket above, but it was troubling that decoding the TaskDescription wasn't failing the tasks. So the Spark job ended up hanging and making no progress, permanently stuck, because the driver thinks the task is running but the thread has died in the executor. We should make a change around https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96 so that when that decode throws an exception, the task is marked as failed. cc [~kayousterhout] [~irashid] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars
Andrew Ash created SPARK-21563: -- Summary: Race condition when serializing TaskDescriptions and adding jars Key: SPARK-21563 URL: https://issues.apache.org/jira/browse/SPARK-21563 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash cc [~robert3005] I was seeing this exception during some running Spark jobs: {noformat} 16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox - Ignoring error java.io.EOFException: null at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readUTF(DataInputStream.java:609) at java.io.DataInputStream.readUTF(DataInputStream.java:564) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127) at org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {noformat} After some debugging, we determined that this is due to a race condition in task serde introduced in SPARK-19796. cc [~irashid] [~kayousterhout] The race is between adding additional jars to the SparkContext and serializing the TaskDescription. Consider this sequence of events: - TaskSetManager creates a TaskDescription using a reference to the SparkContext's jars: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506 - TaskDescription starts serializing, and begins writing jars: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84 - the size of the jar map is written out: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63 - _on another thread_: the application adds a jar to the SparkContext's jars list - then the entries in the jars list are serialized out: https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64 The problem now is that the jars list is serialized as having N entries, but actually N+1 entries follow that count! This causes task deserialization to fail in the executor, with the stacktrace above. The same issue also likely exists for files, though I haven't observed that and our application does not stress that codepath the same way it did for jar additions. One fix here is that TaskSetManager could make an immutable copy of the jars list that it passes into the TaskDescription constructor, so that list doesn't change mid-serialization. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14887) Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits
[ https://issues.apache.org/jira/browse/SPARK-14887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099584#comment-16099584 ] Andrew Ash commented on SPARK-14887: [~fang fang chen] have you seen this in the latest version of Spark? There have been many fixes since the "affects version" Spark 1.5.2 you set on the bug. Can you please retry with latest Spark and let us know if the problem persists? > Generated SpecificUnsafeProjection Exceeds JVM Code Size Limits > --- > > Key: SPARK-14887 > URL: https://issues.apache.org/jira/browse/SPARK-14887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: fang fang chen >Assignee: Davies Liu > > Similiar issue with SPARK-14138 and SPARK-8443: > With large sql syntax(673K), following error happened: > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters
[ https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081136#comment-16081136 ] Andrew Ash commented on SPARK-21289: Looks like this will fix SPARK-17227 also > Text and CSV formats do not support custom end-of-line delimiters > - > > Key: SPARK-21289 > URL: https://issues.apache.org/jira/browse/SPARK-21289 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Yevgen Galchenko >Priority: Minor > > Spark csv and text readers always use default CR, LF or CRLF line terminators > without an option to configure a custom delimiter. > Option "textinputformat.record.delimiter" is not being used to set delimiter > in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() > is used to read file. > Possible solution would be to change HadoopFileLinesReader and create > LineRecordReader with delimiters specified in configuration. LineRecordReader > already supports passing recordDelimiter in its constructor. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15226) CSV file data-line with newline at first line load error
[ https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-15226. -- Resolution: Fixed Fix Version/s: 2.2.0 Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610 > CSV file data-line with newline at first line load error > > > Key: SPARK-15226 > URL: https://issues.apache.org/jira/browse/SPARK-15226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu > Fix For: 2.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > CSV file such as: > --- > v1,v2,"v > 3",v4,v5 > a,b,c,d,e > --- > it contains two row,first row : > v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is > legal) > second row: > a,b,c,d,e > then in spark-shell run commands like: > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > var df = reader.csv("path/to/csvfile") > df.collect > then we find the load data is wrong, > the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15226) CSV file data-line with newline at first line load error
[ https://issues.apache.org/jira/browse/SPARK-15226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081154#comment-16081154 ] Andrew Ash edited comment on SPARK-15226 at 7/10/17 9:07 PM: - Fixed by https://issues.apache.org/jira/browse/SPARK-19610 was (Author: aash): Fixed by Fixed by https://issues.apache.org/jira/browse/SPARK-19610 > CSV file data-line with newline at first line load error > > > Key: SPARK-15226 > URL: https://issues.apache.org/jira/browse/SPARK-15226 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Weichen Xu > Fix For: 2.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > CSV file such as: > --- > v1,v2,"v > 3",v4,v5 > a,b,c,d,e > --- > it contains two row,first row : > v1, v2, v\n3, v4, v5 (in value v\n3 it contains a newline character,it is > legal) > second row: > a,b,c,d,e > then in spark-shell run commands like: > val sqlContext = new org.apache.spark.sql.SQLContext(sc); > var reader = sqlContext.read > var df = reader.csv("path/to/csvfile") > df.collect > then we find the load data is wrong, > the load data has only 3 columns, but in fact it has 5 columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21220) Use outputPartitioning's bucketing if possible on write
Andrew Ash created SPARK-21220: -- Summary: Use outputPartitioning's bucketing if possible on write Key: SPARK-21220 URL: https://issues.apache.org/jira/browse/SPARK-21220 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash When reading a bucketed dataset and writing it back with no transformations (a copy) the bucketing information is lost and the user is required to re-specify the bucketing information on write. This negatively affects read performance on the copied dataset since the bucketing information enables significant optimizations that aren't possible on the un-bucketed copied table. Spark should propagate this bucketing information for copied datasets, and more generally could support inferring bucket information based on the known partitioning of the final RDD at save time when that partitioning is a {{HashPartitioning}}. https://github.com/apache/spark/blob/v2.2.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L118 In the above linked {{bucketIdExpression}}, we could {{.orElse}} a bucket expression based on outputPartitionings that are HashPartitioning. This preserves bucket information for bucketed datasets, and also supports saving this metadata at write time for datasets with a known partitioning. Both of these cases should improve performance at read time of the newly-written dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930 ] Andrew Ash edited comment on SPARK-19700 at 6/22/17 7:47 AM: - Public implementation that's been around a while (Hamel is familiar with this): Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook was (Author: aash): Public implementation that's been around a while: Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058930#comment-16058930 ] Andrew Ash commented on SPARK-19700: Public implementation that's been around a while: Two Sigma's integration with their Cook scheduler, recent diff at https://github.com/twosigma/spark/compare/cd0a083...2.1.0-cook > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058917#comment-16058917 ] Andrew Ash commented on SPARK-19700: Found another potential implementation: Facebook's in-house scheduler mentioned by [~tejasp] at https://github.com/apache/spark/pull/18209#issuecomment-307538061 > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041803#comment-16041803 ] Andrew Ash commented on SPARK-19700: Found another potential implementation: Nomad by [~barnardb] at SPARK-20992 > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035171#comment-16035171 ] Andrew Ash commented on SPARK-20952: For the localProperties on SparkContext it does 2 things I can see to improve safety: - first, it clones the properties for new threads so changes in the parent thread don't unintentionally affect a child thread: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L330 - second, it clears the properties when they're no longer being used: https://github.com/apache/spark/blob/v2.2.0-rc2/core/src/main/scala/org/apache/spark/SparkContext.scala#L1942 Do we need to do do either the defensive cloning or the proactive clearing of taskInfos in executors like are done in the driver? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20815) NullPointerException in RPackageUtils#checkManifestForR
Andrew Ash created SPARK-20815: -- Summary: NullPointerException in RPackageUtils#checkManifestForR Key: SPARK-20815 URL: https://issues.apache.org/jira/browse/SPARK-20815 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.1.1 Reporter: Andrew Ash Some jars don't have manifest files in them, such as in my case javax.inject-1.jar and value-2.2.1-annotations.jar This causes the below NPE: {noformat} Exception in thread "main" java.lang.NullPointerException at org.apache.spark.deploy.RPackageUtils$.checkManifestForR(RPackageUtils.scala:95) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply$mcV$sp(RPackageUtils.scala:180) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1$$anonfun$apply$1.apply(RPackageUtils.scala:180) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1322) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:202) at org.apache.spark.deploy.RPackageUtils$$anonfun$checkAndBuildRPackage$1.apply(RPackageUtils.scala:175) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.deploy.RPackageUtils$.checkAndBuildRPackage(RPackageUtils.scala:175) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:311) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:152) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {noformat} due to RPackageUtils#checkManifestForR assuming {{jar.getManifest}} is non-null. However per the JDK spec it can be null: {noformat} /** * Returns the jar file manifest, or null if none. * * @return the jar file manifest, or null if none * * @throws IllegalStateException * may be thrown if the jar file has been closed * @throws IOException if an I/O error has occurred */ public Manifest getManifest() throws IOException { return getManifestFromReference(); } {noformat} This method should do a null check and return false if the manifest is null (meaning no R code in that jar) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20683) Make table uncache chaining optional
[ https://issues.apache.org/jira/browse/SPARK-20683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018008#comment-16018008 ] Andrew Ash commented on SPARK-20683: Thanks for that diff [~shea.parkes] -- we're planning on trying it in our fork too: https://github.com/palantir/spark/pull/188 > Make table uncache chaining optional > > > Key: SPARK-20683 > URL: https://issues.apache.org/jira/browse/SPARK-20683 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: Not particularly environment sensitive. > Encountered/tested on Linux and Windows. >Reporter: Shea Parkes > > A recent change was made in SPARK-19765 that causes table uncaching to chain. > That is, if table B is a child of table A, and they are both cached, now > uncaching table A will automatically uncache table B. > At first I did not understand the need for this, but when reading the unit > tests, I see that it is likely that many people do not keep named references > to the child table (e.g. B). Perhaps B is just made and cached as some part > of data exploration. In that situation, it makes sense for B to > automatically be uncached when you are finished with A. > However, we commonly utilize a different design pattern that is now harmed by > this automatic uncaching. It is common for us to cache table A to then make > two, independent children tables (e.g. B and C). Once those two child tables > are realized and cached, we'd then uncache table A (as it was no longer > needed and could be quite large). After this change now, when we uncache > table A, we suddenly lose our cached status on both table B and C (which is > quite frustrating). All of these tables are often quite large, and we view > what we're doing as mindful memory management. We are maintaining named > references to B and C at all times, so we can always uncache them ourselves > when it makes sense. > Would it be acceptable/feasible to make this table uncache chaining optional? > I would be fine if the default is for the chaining to happen, as long as we > can turn it off via parameters. > If acceptable, I can try to work towards making the required changes. I am > most comfortable in Python (and would want the optional parameter surfaced in > Python), but have found the places required to make this change in Scala > (since I reverted the functionality in a private fork already). Any help > would be greatly appreciated however. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19700) Design an API for pluggable scheduler implementations
[ https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012011#comment-16012011 ] Andrew Ash commented on SPARK-19700: Found another potential implementation: Eagle cluster manager from https://github.com/epfl-labos/eagle with PR at https://github.com/apache/spark/pull/17974 > Design an API for pluggable scheduler implementations > - > > Key: SPARK-19700 > URL: https://issues.apache.org/jira/browse/SPARK-19700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Matt Cheah > > One point that was brought up in discussing SPARK-18278 was that schedulers > cannot easily be added to Spark without forking the whole project. The main > reason is that much of the scheduler's behavior fundamentally depends on the > CoarseGrainedSchedulerBackend class, which is not part of the public API of > Spark and is in fact quite a complex module. As resource management and > allocation continues evolves, Spark will need to be integrated with more > cluster managers, but maintaining support for all possible allocators in the > Spark project would be untenable. Furthermore, it would be impossible for > Spark to support proprietary frameworks that are developed by specific users > for their other particular use cases. > Therefore, this ticket proposes making scheduler implementations fully > pluggable. The idea is that Spark will provide a Java/Scala interface that is > to be implemented by a scheduler that is backed by the cluster manager of > interest. The user can compile their scheduler's code into a JAR that is > placed on the driver's classpath. Finally, as is the case in the current > world, the scheduler implementation is selected and dynamically loaded > depending on the user's provided master URL. > Determining the correct API is the most challenging problem. The current > CoarseGrainedSchedulerBackend handles many responsibilities, some of which > will be common across all cluster managers, and some which will be specific > to a particular cluster manager. For example, the particular mechanism for > creating the executor processes will differ between YARN and Mesos, but, once > these executors have started running, the means to submit tasks to them over > the Netty RPC is identical across the board. > We must also consider a plugin model and interface for submitting the > application as well, because different cluster managers support different > configuration options, and thus the driver must be bootstrapped accordingly. > For example, in YARN mode the application and Hadoop configuration must be > packaged and shipped to the distributed cache prior to launching the job. A > prototype of a Kubernetes implementation starts a Kubernetes pod that runs > the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979630#comment-15979630 ] Andrew Ash commented on SPARK-20433: It's unclear if Spark is affected, I wanted to open this ticket to start the discussion. > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Critical > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Priority: Major (was: Blocker) > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Security issue with jackson-databind
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Priority: Critical (was: Major) > Security issue with jackson-databind > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Critical > Labels: security > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20433) Security issue with jackson-databind
Andrew Ash created SPARK-20433: -- Summary: Security issue with jackson-databind Key: SPARK-20433 URL: https://issues.apache.org/jira/browse/SPARK-20433 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Andrew Ash Priority: Blocker There was a security vulnerability recently reported to the upstream jackson-databind project at https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix released. >From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the >first fixed versions in their respectful 2.X branches, and versions in the >2.6.X line and earlier remain vulnerable. Right now Spark master branch is on 2.6.5: https://github.com/apache/spark/blob/master/pom.xml#L164 and Hadoop branch-2.7 is on 2.2.3: https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 and Hadoop branch-3.0.0-alpha2 is on 2.7.8: https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 We should try to find to find a way to get on a patched version of jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973763#comment-15973763 ] Andrew Ash commented on SPARK-20364: Thanks for the investigation [~hyukjin.kwon]! The proposal you put up (sorry Rob, wasn't me) looks like a good direction. I'm not sure it will work as written though, since I'd expect the IntColumn/LongColumn/etc classes from Parquet to be expected in other places, an instanceof for example. As an alternate suggestion, I'm wondering if we could call the package-protected constructors of the {{IntColumn}} class from Spark-owned class in that package with the result from the {{ColumnPath.get()}} method, rather than its {{.fromDotString()}} method. Then Spark can use its better knowledge of whether the field is a column-with-dot or a field-in-struct and assemble the ColumnPath correctly. And I do think that this should eventually live in Parquet. Maybe there's a ticket already open there for this? Does that make sense? > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966782#comment-15966782 ] Andrew Ash edited comment on SPARK-1809 at 4/12/17 11:00 PM: - I'm not using Mesos anymore, so closing was (Author: aash): Not using Mesos anymore, so closing > Mesos backend doesn't respect HADOOP_CONF_DIR > - > > Key: SPARK-1809 > URL: https://issues.apache.org/jira/browse/SPARK-1809 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > In order to use HDFS paths without the server component, standalone mode > reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and > get the fs.default.name parameter. > This lets you use HDFS paths like: > - hdfs:///tmp/myfile.txt > instead of > - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt > However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS > paths with the full server even though I have HADOOP_CONF_DIR still set in > spark-env.sh. The HDFS, Spark, and Mesos nodes are all co-located and > non-domain HDFS paths work fine when using the standalone mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-1809) Mesos backend doesn't respect HADOOP_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash closed SPARK-1809. - Resolution: Unresolved Not using Mesos anymore, so closing > Mesos backend doesn't respect HADOOP_CONF_DIR > - > > Key: SPARK-1809 > URL: https://issues.apache.org/jira/browse/SPARK-1809 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > In order to use HDFS paths without the server component, standalone mode > reads spark-env.sh and scans the HADOOP_CONF_DIR to open core-site.xml and > get the fs.default.name parameter. > This lets you use HDFS paths like: > - hdfs:///tmp/myfile.txt > instead of > - hdfs://myserver.mydomain.com:8020/tmp/myfile.txt > However as of recent 1.0.0 pre-release (hash 756c96) I had to specify HDFS > paths with the full server even though I have HADOOP_CONF_DIR still set in > spark-env.sh. The HDFS, Spark, and Mesos nodes are all co-located and > non-domain HDFS paths work fine when using the standalone mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961170#comment-15961170 ] Andrew Ash commented on SPARK-20144: This is a regression from 1.6 to the 2.x line. [~marmbrus] recommended modifying {{spark.sql.files.openCostInBytes}} as a workaround in this post: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-within-partitions-is-not-maintained-in-parquet-td18618.html#a18627 > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939072#comment-15939072 ] Andrew Ash commented on SPARK-19372: I've seen this as well on parquet files. > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19528) external shuffle service would close while still have request from executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-19528: --- Description: when dynamic allocation is enabled, the external shuffle service is used for maintain the unfinished status between executors. So the external shuffle service should not close before the executor while still have request from executor. container's log: {noformat} 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.1.1:41867 17/02/09 08:30:46 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver 17/02/09 08:30:46 INFO executor.Executor: Starting executor ID 75 on host hsx-node8 17/02/09 08:30:46 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40374. 17/02/09 08:30:46 INFO netty.NettyBlockTransferService: Server created on 40374 17/02/09 08:30:46 INFO storage.BlockManager: external shuffle service port = 7337 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/02/09 08:30:46 INFO storage.BlockManagerMaster: Registered BlockManager 17/02/09 08:30:46 INFO storage.BlockManager: Registering executor with local external shuffle service. 17/02/09 08:30:51 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from hsx-node8/192.168.1.8:7337 is closed 17/02/09 08:30:51 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark-project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278) at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:144) at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:215) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:201) at org.apache.spark.executor.Executor.(Executor.scala:86) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:83) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) at org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:274) ... 14 more 17/02/09 08:31:01 ERROR storage.BlockManager: Failed to connect to external shuffle server, will retry 1 more times after waiting 5 seconds... {noformat} nodemanager's log: {noformat} 2017-02-09 08:30:48,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1486564603520_0097_01_05] 2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1486564603520_0096_01_71 is : 1 2017-02-09 08:31:12,122 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1486564603520_0096_01_71 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.ut
[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env
[ https://issues.apache.org/jira/browse/SPARK-20001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20001: --- Description: Similar to SPARK-13587, I'm trying to allow the user to configure a Conda environment that PythonRunner will run from. This change remembers theconda environment found on the driver and installs the same packages on the executor side, only once per PythonWorkerFactory. The list of requested conda packages are added to the PythonWorkerFactory cache, so two collects using the same environment (incl packages) can re-use the same running executors. You have to specify outright what packages and channels to "bootstrap" the environment with. However, SparkContext (as well as JavaSparkContext & the pyspark version) are expanded to support addCondaPackage and addCondaChannel. Rationale is: * you might want to add more packages once you're already running in the driver * you might want to add a channel which requires some token for authentication, which you don't yet have access to until the module is already running This issue requires that the conda binary is already available on the driver as well as executors, you just have to specify where it can be found. Please see the attached pull request on palantir/spark for additional details: https://github.com/palantir/spark/pull/115 As for tests, there is a local python test, as well as yarn client & cluster-mode tests, which ensure that a newly installed package is visible from both the driver and the executor. was: Similar to SPARK-13587, I'm trying to allow the user to configure a Conda environment that PythonRunner will run from. This change remembers theconda environment found on the driver and installs the same packages on the executor side, only once per PythonWorkerFactory. The list of requested conda packages are added to the PythonWorkerFactory cache, so two collects using the same environment (incl packages) can re-use the same running executors. You have to specify outright what packages and channels to "bootstrap" the environment with. However, SparkContext (as well as JavaSparkContext & the pyspark version) are expanded to support addCondaPackage and addCondaChannel. Rationale is: * you might want to add more packages once you're already running in the driver * you might want to add a channel which requires some token for authentication, which you don't yet have access to until the module is already running This issue requires that the conda binary is already available on the driver as well as executors, you just have to specify where it can be found. Please see the attached issue on palantir/spark for additional details: https://github.com/palantir/spark/pull/115 As for tests, there is a local python test, as well as yarn client & cluster-mode tests, which ensure that a newly installed package is visible from both the driver and the executor. > Support PythonRunner executing inside a Conda env > - > > Key: SPARK-20001 > URL: https://issues.apache.org/jira/browse/SPARK-20001 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Dan Sanduleac > Original Estimate: 168h > Remaining Estimate: 168h > > Similar to SPARK-13587, I'm trying to allow the user to configure a Conda > environment that PythonRunner will run from. > This change remembers theconda environment found on the driver and installs > the same packages on the executor side, only once per PythonWorkerFactory. > The list of requested conda packages are added to the PythonWorkerFactory > cache, so two collects using the same environment (incl packages) can re-use > the same running executors. > You have to specify outright what packages and channels to "bootstrap" the > environment with. > However, SparkContext (as well as JavaSparkContext & the pyspark version) are > expanded to support addCondaPackage and addCondaChannel. > Rationale is: > * you might want to add more packages once you're already running in the > driver > * you might want to add a channel which requires some token for > authentication, which you don't yet have access to until the module is > already running > This issue requires that the conda binary is already available on the driver > as well as executors, you just have to specify where it can be found. > Please see the attached pull request on palantir/spark for additional > details: https://github.com/palantir/spark/pull/115 > As for tests, there is a local python test, as well as yarn client & > cluster-mode tests, which ensure that a newly installed package is visible > from both the driver and the executor. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --
[jira] [Updated] (SPARK-20001) Support PythonRunner executing inside a Conda env
[ https://issues.apache.org/jira/browse/SPARK-20001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20001: --- Description: Similar to SPARK-13587, I'm trying to allow the user to configure a Conda environment that PythonRunner will run from. This change remembers theconda environment found on the driver and installs the same packages on the executor side, only once per PythonWorkerFactory. The list of requested conda packages are added to the PythonWorkerFactory cache, so two collects using the same environment (incl packages) can re-use the same running executors. You have to specify outright what packages and channels to "bootstrap" the environment with. However, SparkContext (as well as JavaSparkContext & the pyspark version) are expanded to support addCondaPackage and addCondaChannel. Rationale is: * you might want to add more packages once you're already running in the driver * you might want to add a channel which requires some token for authentication, which you don't yet have access to until the module is already running This issue requires that the conda binary is already available on the driver as well as executors, you just have to specify where it can be found. Please see the attached issue on palantir/spark for additional details: https://github.com/palantir/spark/pull/115 As for tests, there is a local python test, as well as yarn client & cluster-mode tests, which ensure that a newly installed package is visible from both the driver and the executor. was: Similar to SPARK-13587, I'm trying to allow the user to configure a Conda environment that PythonRunner will run from. This change remembers theconda environment found on the driver and installs the same packages on the executor side, only once per PythonWorkerFactory. The list of requested conda packages are added to the PythonWorkerFactory cache, so two collects using the same environment (incl packages) can re-use the same running executors. You have to specify outright what packages and channels to "bootstrap" the environment with. However, SparkContext (as well as JavaSparkContext & the pyspark version) are expanded to support addCondaPackage and addCondaChannel. Rationale is: * you might want to add more packages once you're already running in the driver * you might want to add a channel which requires some token for authentication, which you don't yet have access to until the module is already running This issue requires that the conda binary is already available on the driver as well as executors, you just have to specify where it can be found. Please see the attached issue on palantir/spark for additional details. As for tests, there is a local python test, as well as yarn client & cluster-mode tests, which ensure that a newly installed package is visible from both the driver and the executor. > Support PythonRunner executing inside a Conda env > - > > Key: SPARK-20001 > URL: https://issues.apache.org/jira/browse/SPARK-20001 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Affects Versions: 2.2.0 >Reporter: Dan Sanduleac > Original Estimate: 168h > Remaining Estimate: 168h > > Similar to SPARK-13587, I'm trying to allow the user to configure a Conda > environment that PythonRunner will run from. > This change remembers theconda environment found on the driver and installs > the same packages on the executor side, only once per PythonWorkerFactory. > The list of requested conda packages are added to the PythonWorkerFactory > cache, so two collects using the same environment (incl packages) can re-use > the same running executors. > You have to specify outright what packages and channels to "bootstrap" the > environment with. > However, SparkContext (as well as JavaSparkContext & the pyspark version) are > expanded to support addCondaPackage and addCondaChannel. > Rationale is: > * you might want to add more packages once you're already running in the > driver > * you might want to add a channel which requires some token for > authentication, which you don't yet have access to until the module is > already running > This issue requires that the conda binary is already available on the driver > as well as executors, you just have to specify where it can be found. > Please see the attached issue on palantir/spark for additional details: > https://github.com/palantir/spark/pull/115 > As for tests, there is a local python test, as well as yarn client & > cluster-mode tests, which ensure that a newly installed package is visible > from both the driver and the executor. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929591#comment-15929591 ] Andrew Ash commented on SPARK-18278: As an update on this ticket: For those not already aware, work on native Spark integration with Kubernetes has been proceeding for the past several months in this repo https://github.com/apache-spark-on-k8s/spark in the {{branch-2.1-kubernetes}} branch, based off the 2.1.0 Apache release. We have an active core of about a half dozen contributors to the project with a wider group observing of about another dozen. Communication happens through the issues on the GitHub repo, a dedicated room in the Kubernetes Slack, and weekly video conferences hosted by the Kubernetes Big Data SIG. The full patch set is currently about 5500 lines, with about 500 of that as user/dev documentation. Infrastructure-wise, we have a cloud-hosted CI Jenkins instance set up donated by project members, which is running both unit tests and Kubernetes integration tests over the code. We recently entered a code freeze for our release branch and are preparing a first release to the wider community, which we plan to announce on the general Spark users list. It includes the completed "phase one" portion of the design doc shared a few months ago (https://docs.google.com/document/d/1_bBzOZ8rKiOSjQg78DXOA3ZBIo_KkDJjqxVuq0yXdew/edit#heading=h.fua3ml5mcolt), featuring cluster mode with static allocation of executors, submission of local resources, SSL throughout, and support for JVM languages (Java/Scala). After that release we'll be continuing to stabilize and improve the phase one feature set and move into a second phase of kubernetes work. It will likely be focused on support for dynamic allocation, though we haven't finalized planning for phase two yet. Working on the pluggable scheduler in SPARK-19700 may be included as well. Interested parties are of course welcome to watch the repo, join the weekly video conferences, give the code a shot, and contribute to the project! > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893222#comment-15893222 ] Andrew Ash commented on SPARK-18113: We discovered another bug related to committing that causes task deadloop and have work being done in SPARK-19631 to fix it. > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$Def
[jira] [Commented] (SPARK-7354) Flaky test: o.a.s.deploy.SparkSubmitSuite --jars
[ https://issues.apache.org/jira/browse/SPARK-7354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881593#comment-15881593 ] Andrew Ash commented on SPARK-7354: --- We saw a flake for this test in the k8s repo's Travis builds too: https://github.com/apache-spark-on-k8s/spark/issues/110#issuecomment-281837162 > Flaky test: o.a.s.deploy.SparkSubmitSuite --jars > > > Key: SPARK-7354 > URL: https://issues.apache.org/jira/browse/SPARK-7354 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 1.6.0, 2.0.0 >Reporter: Tathagata Das >Assignee: Andrew Or >Priority: Critical > Labels: flaky-test > > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865778#comment-15865778 ] Andrew Ash commented on SPARK-18113: [~xukun] the scenario you describe should be accommodated by the newly-added {{spark.scheduler.outputcommitcoordinator.maxwaittime}} in that PR, which sets a timeout for how long an executor can hold the commit lock for a task. If the executor fails while holding the lock, then after that period of time the lock will be released and a subsequent executor will be able to commit the task. By default right now it is 2min. > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > second
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865536#comment-15865536 ] Andrew Ash commented on SPARK-18113: Thanks for the updates you both. I've been working with a coworker who's seeing this on a cluster of his and we think the deadloop is caused by an interaction between commit authorization and preemption (his cluster has lots of preemption occurring). You can see the in-progress patch we've been working on at https://github.com/palantir/spark/pull/94 which adds a two-phase commit sequence to the OutputCommitCoordinator. We're not done testing it yet but it might provide a useful example for discussion here. Maybe also we should file another ticket for the deadloop in preemption scenario, which is a little different from the deadloop in lost message scenario (what I think the already-merged PR fixes). > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.Thre
[jira] [Commented] (SPARK-19493) Remove Java 7 support
[ https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861576#comment-15861576 ] Andrew Ash commented on SPARK-19493: +1 -- we're removing Java 7 compatibility from core internal libraries run in Spark and rarely encounter clusters running with Java 7 anymore. > Remove Java 7 support > - > > Key: SPARK-19493 > URL: https://issues.apache.org/jira/browse/SPARK-19493 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to > officially remove Java 7 support in 2.2 or 2.3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11471) Improve the way that we plan shuffled join
[ https://issues.apache.org/jira/browse/SPARK-11471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836547#comment-15836547 ] Andrew Ash commented on SPARK-11471: [~yhuai] I'm interested in helping make progress on this -- it's causing pretty significant slowness on a SQL query I have. Have you had any more thoughts on this since you first filed? For my slow query, running the same SQL on another system (Teradata) completes in a few minutes whereas in Spark SQL it eventually fails after 20h+ of computation. By breaking up the query into subqueries (forcing joins to happen in a different order) we can get the Spark SQL execution down to about 15min. The query joins 8 tables together with a number of where clauses, and the join tree ends up doing a large number of seemingly "extra" shuffles. At the bottom of the join tree, tables A and B are both shuffled to the same distribution and then joined. Then when table C comes in (the next table) it's shuffled to a new distribution and then the result of (A+B) is reshuffled to match the new table C distribution. Potentially instead table C would be shuffled to match the distribution of (A+B) so that (A+B) doesn't have to be reshuffled. The general pattern here seems to be that N-Way joins require N + N-1 = 2N-1 shuffles. Potentially some of those shuffles could be eliminated with more intelligent join ordering and/or distribution selection. What do you think? > Improve the way that we plan shuffled join > -- > > Key: SPARK-11471 > URL: https://issues.apache.org/jira/browse/SPARK-11471 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > > Right now, when adaptive query execution is enabled, in most of cases, we > will shuffle input tables for every join. However, once we finish our work of > https://issues.apache.org/jira/browse/SPARK-10665, we will be able to have a > global on the input datasets of a stage. Then, we should be able to add > exchange coordinators after we get the entire physical plan (after the phase > that we add Exchanges). > I will try to fill in more information later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19213) FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time
[ https://issues.apache.org/jira/browse/SPARK-19213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-19213: --- Summary: FileSourceScanExec uses SparkSession from HadoopFsRelation creation time instead of the active session at execution time (was: FileSourceScanExec usese sparksession from hadoopfsrelation creation time instead of the one active at time of execution) > FileSourceScanExec uses SparkSession from HadoopFsRelation creation time > instead of the active session at execution time > > > Key: SPARK-19213 > URL: https://issues.apache.org/jira/browse/SPARK-19213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Robert Kruszewski > > If you look at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L260 > you'll notice that the sparksession used for execution is the one that was > captured from logicalplan. Whereas in other places you have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L154 > and SparkPlan captures active session upon execution in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L52 > From my understanding of the code it looks like we should be using the > sparksession that is currently active hence take the one from spark plan. > However, in case you want share Datasets across SparkSessions that is not > enough since as soon as dataset is executed the queryexecution will have > capture spark session at that point. If we want to share datasets across > users we need to make configurations not fixed upon first execution. I > consider 1st part (using sparksession from logical plan) a bug while the > second (using sparksession active at runtime) an enhancement so that sharing > across sessions is made easier. > For example: > {code} > val df = spark.read.parquet(...) > df.count() > val newSession = spark.newSession() > SparkSession.setActiveSession(newSession) > // (simplest one to try is disable > vectorized reads) > val df2 = Dataset.ofRows(newSession, df.logicalPlan) // logical plan still > holds reference to original sparksession and changes don't take effect > {code} > I suggest that it shouldn't be necessary to create a new dataset for changes > to take effect. For most of the plans doing Dataset.ofRows work but this is > not the case for hadoopfsrelation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812819#comment-15812819 ] Andrew Ash commented on SPARK-18113: I've done some more diagnosis on an example I saw, and think there's a failure mode that https://issues.apache.org/jira/browse/SPARK-8029 didn't consider. Here's the sequence of steps I think causes this issue: - executor requests a task commit - coordinator approves commit and sends response message - executor is preempted by YARN! - response message goes nowhere - OCC has attempt=0 fixed for that stage/partition now so no other attempt will succeed :( Are you running in YARN with preemption enabled? Is there preemption activity around the time of the task deadloop? > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) >
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811041#comment-15811041 ] Andrew Ash commented on SPARK-18113: Thanks for sending in that PR [~jinxing6...@126.com]! It's very similar to the one I've also been testing -- see https://github.com/palantir/spark/pull/79 for that diff (I haven't sent a PR into Apache yet). Has that fixed the problem for you? > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.s
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802102#comment-15802102 ] Andrew Ash commented on SPARK-18113: [~xq2005] can you please send a PR to https://github.com/apache/spark with your proposed change? I think I'm seeing the same issue and would like to see the code diff you're suggesting so I can test the fix. Thanks! > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > sc
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752907#comment-15752907 ] Andrew Ash commented on SPARK-18278: There are definitely challenges in building features that take longer than a release cycle (quarterly for Spark). We could maintain a long-running feature branch for spark-k8s that lasts several months and then gets merged into Spark in a big-bang merge, with that feature branch living either on apache/spark or in some other community-accessible repo. I don't think there are many practical differences between in apache/spark vs a different repo for where the source is hosted if both are not in Apache releases. Or we could merge many smaller commits for spark-k8s into the apache/spark master branch along the way and release as an experimental feature when release time comes. This enables more continuous code review but has the risk of destabilizing the master branch if code reviews miss things. Looking to past instances of large features spanning multiple release cycles (like SparkSQL and YARN integration), both of those had work happening primarily in-repo from what I can tell, and releases included large disclaimers in release notes for those experimental features. That precedent seems to suggest Kubernetes integration should follow a similar path. Personally I lean towards the approach of more smaller commits into master rather than a long-running feature branch. By code reviewing PRs into the main repo as we go the feature will be easier to code review and will also get wider feedback as an experimental feature than a side branch or side repo would get. This also serves to include Apache committers from the start in understanding the codebase, rather than foisting a foreign codebase onto the project and hope committers grok it well enough to hold the line on high quality code reviews. Looking to the future where Kubernetes integration is potentially included in the mainline apache release (like Mesos and YARN), it's best to work as contributor + committer together from the start for shared understanding. Making an API for third party cluster managers sound great and the easy, clean choice from a software engineering point of view, but I wonder how much value the practical benefits of having a pluggable cluster manager actually gets the Apache project. It seems like both Two Sigma and IBM have been able to maintain their proprietary schedulers without the benefits of the API we're considering building. Who / what workflows are we aiming to support with an API? > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17119) Add configuration property to allow the history server to delete .inprogress files
[ https://issues.apache.org/jira/browse/SPARK-17119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752439#comment-15752439 ] Andrew Ash commented on SPARK-17119: +1 I would use this feature > Add configuration property to allow the history server to delete .inprogress > files > -- > > Key: SPARK-17119 > URL: https://issues.apache.org/jira/browse/SPARK-17119 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bjorn Jonsson >Priority: Minor > Labels: historyserver > > The History Server (HS) currently only considers completed applications when > deleting event logs from spark.history.fs.logDirectory (since SPARK-6879). > This means that over time, .inprogress files (from failed jobs, jobs where > the SparkContext is not closed, spark-shell exits etc...) can accumulate and > impact the HS. > Instead of having to manually delete these files, maybe users could have the > option of telling the HS to delete all files where (now - > attempt.lastUpdated) > spark.history.fs.cleaner.maxAge, or just delete > .inprogress files with lastUpdated older then 7d? > https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L467 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-18113: --- Description: Executor sends *AskPermissionToCommitOutput* to driver failed, and retry another sending. Driver receives 2 AskPermissionToCommitOutput messages and handles them. But executor ignores the first response(true) and receives the second response(false). The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite loop. h4. Driver Log: {noformat} 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 0 ... 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 0 ... 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, partition: 24, attemptNumber: 1 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, stage: 2, partition: 24, attempt: 1 ... 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) ... {noformat} h4. Executor Log: {noformat} ... 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) ... 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = AskPermissionToCommitOutput(2,24,0)] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) ... 13 more ... 16/10/25 05:39:16 INFO Executor: Running task 24.1 in stage 2.0 (TID 119) ... 16/10/25 05:39:24 INFO SparkHadoopMapRedUtil: attempt_201610250536_0002_m_24_119: Not committed because the driver did not authorize commit ... {noformat} was: Executor sends *AskPermissionToCommitOutput* to driver failed, and retry another sending. Driver receives 2 AskPermissionToCommitOutput messages and handles them. But executor ignores the first response(true) and receives the second response(false). The TaskAttemptNumber for this partition in authorizedCommittersByStage is locked forever. Driver enters into infinite loop. h4. Driver Log: 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24
[jira] [Updated] (SPARK-17664) Failed to saveAsHadoop when speculate is enabled
[ https://issues.apache.org/jira/browse/SPARK-17664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-17664: --- Description: >From follow logs, task 22 has failed 4 times because of "the driver did not >authorize commit". But the strange thing was that I could't find task 22.1. >Why? Maybe some synchronization error? {noformat} 16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 953902) on executor 10.196.131.13: java.security.PrivilegedActionException (null) [duplicate 4] 16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 10.196.131.13) as speculatable because it ran more than 5601 ms 16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.2 in stage 1856.0 (TID 954074, 10.215.143.14, partition 22,PROCESS_LOCAL, 2163 bytes) 16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.2 in stage 1856.0 (TID 954074) on executor 10.215.143.14: java.security.PrivilegedActionException (null) [duplicate 5] 16/09/26 02:14:18 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 10.196.131.13) as speculatable because it ran more than 5601 ms 16/09/26 02:14:18 INFO TaskSetManager: Starting task 22.3 in stage 1856.0 (TID 954075, 10.196.131.28, partition 22,PROCESS_LOCAL, 2163 bytes) 16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.3 in stage 1856.0 (TID 954075) on executor 10.196.131.28: java.security.PrivilegedActionException (null) [duplicate 6] 16/09/26 02:14:19 INFO TaskSetManager: Marking task 22 in stage 1856.0 (on 10.196.131.13) as speculatable because it ran more than 5601 ms 16/09/26 02:14:19 INFO TaskSetManager: Starting task 22.4 in stage 1856.0 (TID 954076, 10.215.153.225, partition 22,PROCESS_LOCAL, 2163 bytes) 16/09/26 02:14:19 INFO TaskSetManager: Lost task 22.4 in stage 1856.0 (TID 954076) on executor 10.215.153.225: java.security.PrivilegedActionException (null) [duplicate 7] 16/09/26 02:14:19 ERROR TaskSetManager: Task 22 in stage 1856.0 failed 4 times; aborting job 16/09/26 02:14:19 INFO YarnClusterScheduler: Cancelling stage 1856 16/09/26 02:14:19 INFO YarnClusterScheduler: Stage 1856 was cancelled 16/09/26 02:14:19 INFO DAGScheduler: ResultStage 1856 (saveAsHadoopFile at TDWProvider.scala:514) failed in 23.049 s 16/09/26 02:14:19 INFO DAGScheduler: Job 76 failed: saveAsHadoopFile at TDWProvider.scala:514, took 69.865181 s 16/09/26 02:14:19 ERROR ApplicationMaster: User class threw exception: java.security.PrivilegedActionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 22 in stage 1856.0 failed 4 times, most recent failure: Lost task 22.4 in stage 1856.0 (TID 954076, 10.215.153.225): java.security.PrivilegedActionException: org.apache.spark.executor.CommitDeniedException: attempt_201609260213_1856_m_22_954076: Not committed because the driver did not authorize commit at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:356) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1723) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1284) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.spark.executor.CommitDeniedException: attempt_201609260213_1856_m_22_954076: Not committed because the driver did not authorize commit at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:135) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:142) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1311) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anon$4.run(PairRDDFunctions.scala:1284) ... 11 more {noformat} was: >From follow logs, task 22 has failed 4 times because of "the driver did not >authorize commit". But the strange thing was that I could't find task 22.1. >Why? Maybe some synchronization error? 16/09/26 02:14:18 INFO TaskSetManager: Lost task 22.0 in stage 1856.0 (TID 953902) on executor 10.196.131.13: java.security.PrivilegedActionException (null) [duplicate 4] 16/09/26 02:14:18 INFO TaskSetMana
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15731562#comment-15731562 ] Andrew Ash commented on SPARK-18278: [~rxin] is it a problem for ASF projects to publish docker images? Have other Apache projects done this before, or would Spark be the first one? > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18499) Add back support for custom Spark SQL dialects
[ https://issues.apache.org/jira/browse/SPARK-18499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675714#comment-15675714 ] Andrew Ash commented on SPARK-18499: Specifically what I'm most interested in is a strict ANSI SQL dialect, not bending Spark SQL to support a proprietary dialect. > Add back support for custom Spark SQL dialects > -- > > Key: SPARK-18499 > URL: https://issues.apache.org/jira/browse/SPARK-18499 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Ash > > Point 5 from the parent task: > {quote} > 5. I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18499) Add back support for custom Spark SQL dialects
Andrew Ash created SPARK-18499: -- Summary: Add back support for custom Spark SQL dialects Key: SPARK-18499 URL: https://issues.apache.org/jira/browse/SPARK-18499 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Andrew Ash Point 5 from the parent task: {quote} 5. I want to be able to use my own customized SQL constructs. An example of this would supporting my own dialect, or be able to add constructs to the current SQL language. I should not have to implement a complete parse, and should be able to delegate to an underlying parser. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18448) SparkSession should implement java.lang.AutoCloseable like JavaSparkContext
Andrew Ash created SPARK-18448: -- Summary: SparkSession should implement java.lang.AutoCloseable like JavaSparkContext Key: SPARK-18448 URL: https://issues.apache.org/jira/browse/SPARK-18448 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Reporter: Andrew Ash https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html This makes using cleaning up SparkSessions in Java easier, but may introduce a Java 8 dependency if applied directly. JavaSparkContext uses java.io.Closeable I think to avoid this Java 8 dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17874) Enabling SSL on HistoryServer should only open one port not two
Andrew Ash created SPARK-17874: -- Summary: Enabling SSL on HistoryServer should only open one port not two Key: SPARK-17874 URL: https://issues.apache.org/jira/browse/SPARK-17874 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.0.1 Reporter: Andrew Ash When turning on SSL on the HistoryServer with {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262] result of calculating {{spark.history.ui.port + 400}}, and sets up a redirect from the original (http) port to the new (https) port. {noformat} $ netstat -nlp | grep 23714 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) tcp0 0 :::18080:::* LISTEN 23714/java tcp0 0 :::18480:::* LISTEN 23714/java {noformat} By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one open port to change protocol from http to https, not to have 1) additional ports open 2) the http port remain open 3) the additional port at a value I didn't specify. To fix this could take one of two approaches: Approach 1: - one port always, which is configured with {{spark.history.ui.port}} - the protocol on that port is http by default - or if {{spark.ssl.historyServer.enabled=true}} then it's https Approach 2: - add a new configuration item {{spark.history.ui.sslPort}} which configures the second port that starts up In approach 1 we probably need a way to specify to Spark jobs whether the history server has ssl or not, based on SPARK-16988 That makes me think we should go with approach 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17874) Additional SSL port on HistoryServer should be configurable
[ https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-17874: --- Summary: Additional SSL port on HistoryServer should be configurable (was: Enabling SSL on HistoryServer should only open one port not two) > Additional SSL port on HistoryServer should be configurable > --- > > Key: SPARK-17874 > URL: https://issues.apache.org/jira/browse/SPARK-17874 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.1 >Reporter: Andrew Ash > > When turning on SSL on the HistoryServer with > {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the > [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262] > result of calculating {{spark.history.ui.port + 400}}, and sets up a > redirect from the original (http) port to the new (https) port. > {noformat} > $ netstat -nlp | grep 23714 > (Not all processes could be identified, non-owned process info > will not be shown, you would have to be root to see it all.) > tcp0 0 :::18080:::* > LISTEN 23714/java > tcp0 0 :::18480:::* > LISTEN 23714/java > {noformat} > By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one > open port to change protocol from http to https, not to have 1) additional > ports open 2) the http port remain open 3) the additional port at a value I > didn't specify. > To fix this could take one of two approaches: > Approach 1: > - one port always, which is configured with {{spark.history.ui.port}} > - the protocol on that port is http by default > - or if {{spark.ssl.historyServer.enabled=true}} then it's https > Approach 2: > - add a new configuration item {{spark.history.ui.sslPort}} which configures > the second port that starts up > In approach 1 we probably need a way to specify to Spark jobs whether the > history server has ssl or not, based on SPARK-16988 > That makes me think we should go with approach 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17227) Allow configuring record delimiter in csv
[ https://issues.apache.org/jira/browse/SPARK-17227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15435539#comment-15435539 ] Andrew Ash commented on SPARK-17227: Rob and I work together, and we've seen datasets in mostly-CSV format that have non-standard record delimiters ('\0' character for instance). For some broader context, we've created our own CSV text parser and use that in all our various internal products that use Spark, but would like to contribute this additional flexibility back to the Spark community at large and in the process eliminate the need for our internal CSV datasource. Here are the tickets Rob just opened that we would require to eliminate our internal CSV datasource: SPARK-17222 SPARK-17224 SPARK-17225 SPARK-17226 SPARK-17227 The basic question then, is would the Spark community accept patches that extend Spark's CSV parser to cover these features? We're willing to write the code and get the patches through code review, but would rather know up front if these changes would never be accepted into mainline Spark due to philosophical disagreements around what Spark's CSV datasource should be. > Allow configuring record delimiter in csv > - > > Key: SPARK-17227 > URL: https://issues.apache.org/jira/browse/SPARK-17227 > Project: Spark > Issue Type: Improvement >Reporter: Robert Kruszewski >Priority: Minor > > Instead of hard coded "\n" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17029) Dataset toJSON goes through RDD form instead of transforming dataset itself
[ https://issues.apache.org/jira/browse/SPARK-17029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418149#comment-15418149 ] Andrew Ash commented on SPARK-17029: Note RDD form usage from https://issues.apache.org/jira/browse/SPARK-10705 > Dataset toJSON goes through RDD form instead of transforming dataset itself > --- > > Key: SPARK-17029 > URL: https://issues.apache.org/jira/browse/SPARK-17029 > Project: Spark > Issue Type: Bug >Reporter: Robert Kruszewski > > No longer necessary and can be optimized with datasets -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15104) Bad spacing in log line
Andrew Ash created SPARK-15104: -- Summary: Bad spacing in log line Key: SPARK-15104 URL: https://issues.apache.org/jira/browse/SPARK-15104 Project: Spark Issue Type: Bug Affects Versions: 1.6.1 Reporter: Andrew Ash Priority: Minor {noformat}INFO [2016-05-03 21:18:51,477] org.apache.spark.scheduler.TaskSetManager: Starting task 0.0 in stage 101.0 (TID 7029, localhost, partition 0,NODE_LOCAL, 1894 bytes){noformat} Should have a space before "NODE_LOCAL" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org