[jira] [Assigned] (SPARK-23447) Cleanup codegen template for Literal
[ https://issues.apache.org/jira/browse/SPARK-23447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23447: Assignee: Apache Spark > Cleanup codegen template for Literal > > > Key: SPARK-23447 > URL: https://issues.apache.org/jira/browse/SPARK-23447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Major > > Ideally, the codegen templates for {{Literal}} should emit literals in the > {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be > effectively inlined into their use sites. > But currently there are a couple of paths where {{Literal.doGenCode()}} > return {{ExprCode}} that has non-trivial {{code}} field, and all of those are > actually unnecessary. > We can make a simple refactoring to make sure all codegen templates for > {{Literal}} return empty {{code}} and simple literal/constant expressions in > {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23446) Explicitly check supported types in toPandas
[ https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366686#comment-16366686 ] Apache Spark commented on SPARK-23446: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20625 > Explicitly check supported types in toPandas > > > Key: SPARK-23446 > URL: https://issues.apache.org/jira/browse/SPARK-23446 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See: > {code} > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df = spark.createDataFrame([[bytearray("a")]]) > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > _1 > 0 [97] > _1 > 0 a > {code} > We didn't finish binary type in Arrow conversion at Python side. We should > disallow it. > Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23446) Explicitly check supported types in toPandas
[ https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23446: Assignee: (was: Apache Spark) > Explicitly check supported types in toPandas > > > Key: SPARK-23446 > URL: https://issues.apache.org/jira/browse/SPARK-23446 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See: > {code} > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df = spark.createDataFrame([[bytearray("a")]]) > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > _1 > 0 [97] > _1 > 0 a > {code} > We didn't finish binary type in Arrow conversion at Python side. We should > disallow it. > Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23446) Explicitly check supported types in toPandas
[ https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23446: Assignee: Apache Spark > Explicitly check supported types in toPandas > > > Key: SPARK-23446 > URL: https://issues.apache.org/jira/browse/SPARK-23446 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > See: > {code} > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df = spark.createDataFrame([[bytearray("a")]]) > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > _1 > 0 [97] > _1 > 0 a > {code} > We didn't finish binary type in Arrow conversion at Python side. We should > disallow it. > Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23446) Explicitly check supported types in toPandas
[ https://issues.apache.org/jira/browse/SPARK-23446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23446: - Summary: Explicitly check supported types in toPandas (was: Explicitly specify supported types in toPandas) > Explicitly check supported types in toPandas > > > Key: SPARK-23446 > URL: https://issues.apache.org/jira/browse/SPARK-23446 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > See: > {code} > spark.conf.set("spark.sql.execution.arrow.enabled", "false") > df = spark.createDataFrame([[bytearray("a")]]) > df.toPandas() > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > df.toPandas() > {code} > {code} > _1 > 0 [97] > _1 > 0 a > {code} > We didn't finish binary type in Arrow conversion at Python side. We should > disallow it. > Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23447) Cleanup codegen template for Literal
Kris Mok created SPARK-23447: Summary: Cleanup codegen template for Literal Key: SPARK-23447 URL: https://issues.apache.org/jira/browse/SPARK-23447 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1, 2.2.0 Reporter: Kris Mok Ideally, the codegen templates for {{Literal}} should emit literals in the {{isNull}} and {{value}} fields of {{ExprCode}} so that they can be effectively inlined into their use sites. But currently there are a couple of paths where {{Literal.doGenCode()}} return {{ExprCode}} that has non-trivial {{code}} field, and all of those are actually unnecessary. We can make a simple refactoring to make sure all codegen templates for {{Literal}} return empty {{code}} and simple literal/constant expressions in {{isNull}} and {{value}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23446) Explicitly specify supported types in toPandas
Hyukjin Kwon created SPARK-23446: Summary: Explicitly specify supported types in toPandas Key: SPARK-23446 URL: https://issues.apache.org/jira/browse/SPARK-23446 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Hyukjin Kwon See: {code} spark.conf.set("spark.sql.execution.arrow.enabled", "false") df = spark.createDataFrame([[bytearray("a")]]) df.toPandas() spark.conf.set("spark.sql.execution.arrow.enabled", "true") df.toPandas() {code} {code} _1 0 [97] _1 0 a {code} We didn't finish binary type in Arrow conversion at Python side. We should disallow it. Same thine applies to nested timestamps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366547#comment-16366547 ] Maxim Gekk commented on SPARK-23410: [~sameerag] It is not blocker anymore. I unset the blocker flag. > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-23410: --- Priority: Major (was: Blocker) > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Major > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366537#comment-16366537 ] Sameer Agarwal commented on SPARK-23410: [~maxgekk] [~smilegator] any ETA on this? As [~hyukjin.kwon] points out, given that https://github.com/apache/spark/pull/20302 is reverted, should we still block RC4 on this? > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Blocker > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23445) ColumnStat refactoring
[ https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23445: Assignee: (was: Apache Spark) > ColumnStat refactoring > -- > > Key: SPARK-23445 > URL: https://issues.apache.org/jira/browse/SPARK-23445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Priority: Major > > Refactor ColumnStat to be more flexible. > * Split {{ColumnStat}} and {{CatalogColumnStat}} just like > {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the > statistics are stored from how they are processed in the query plan. > {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not > depend on dataType information. > * For {{CatalogColumnStat}}, parse column names from property names in the > metastore ({{KEY_VERSION }}property), not from metastore schema. This allows > the catalog to read stats into {{CatalogColumnStat}}s even if the schema > itself is not in the metastore. > * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns > were optional already. Having them all optional is more consistent, and gives > flexibility to e.g. drop some of the fields through transformations if they > are difficult / impossible to calculate. > The added flexibility will make it possible to have alternative > implementations for stats, and separates stats collection from stats and > estimation processing in plans. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23445) ColumnStat refactoring
[ https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366530#comment-16366530 ] Apache Spark commented on SPARK-23445: -- User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/20624 > ColumnStat refactoring > -- > > Key: SPARK-23445 > URL: https://issues.apache.org/jira/browse/SPARK-23445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Priority: Major > > Refactor ColumnStat to be more flexible. > * Split {{ColumnStat}} and {{CatalogColumnStat}} just like > {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the > statistics are stored from how they are processed in the query plan. > {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not > depend on dataType information. > * For {{CatalogColumnStat}}, parse column names from property names in the > metastore ({{KEY_VERSION }}property), not from metastore schema. This allows > the catalog to read stats into {{CatalogColumnStat}}s even if the schema > itself is not in the metastore. > * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns > were optional already. Having them all optional is more consistent, and gives > flexibility to e.g. drop some of the fields through transformations if they > are difficult / impossible to calculate. > The added flexibility will make it possible to have alternative > implementations for stats, and separates stats collection from stats and > estimation processing in plans. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23445) ColumnStat refactoring
[ https://issues.apache.org/jira/browse/SPARK-23445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23445: Assignee: Apache Spark > ColumnStat refactoring > -- > > Key: SPARK-23445 > URL: https://issues.apache.org/jira/browse/SPARK-23445 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Juliusz Sompolski >Assignee: Apache Spark >Priority: Major > > Refactor ColumnStat to be more flexible. > * Split {{ColumnStat}} and {{CatalogColumnStat}} just like > {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the > statistics are stored from how they are processed in the query plan. > {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not > depend on dataType information. > * For {{CatalogColumnStat}}, parse column names from property names in the > metastore ({{KEY_VERSION }}property), not from metastore schema. This allows > the catalog to read stats into {{CatalogColumnStat}}s even if the schema > itself is not in the metastore. > * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns > were optional already. Having them all optional is more consistent, and gives > flexibility to e.g. drop some of the fields through transformations if they > are difficult / impossible to calculate. > The added flexibility will make it possible to have alternative > implementations for stats, and separates stats collection from stats and > estimation processing in plans. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23445) ColumnStat refactoring
Juliusz Sompolski created SPARK-23445: - Summary: ColumnStat refactoring Key: SPARK-23445 URL: https://issues.apache.org/jira/browse/SPARK-23445 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Juliusz Sompolski Refactor ColumnStat to be more flexible. * Split {{ColumnStat}} and {{CatalogColumnStat}} just like {{CatalogStatistics}} is split from {{Statistics}}. This detaches how the statistics are stored from how they are processed in the query plan. {{CatalogColumnStat}} keeps {{min}} and {{max}} as {{String}}, making it not depend on dataType information. * For {{CatalogColumnStat}}, parse column names from property names in the metastore ({{KEY_VERSION }}property), not from metastore schema. This allows the catalog to read stats into {{CatalogColumnStat}}s even if the schema itself is not in the metastore. * Make all fields optional. {{min}}, {{max}} and {{histogram}} for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate. The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366421#comment-16366421 ] Shixiong Zhu commented on SPARK-23433: -- cc [~irashid] > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage
[ https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366417#comment-16366417 ] Shixiong Zhu commented on SPARK-23433: -- {code} 18/02/11 13:22:20 INFO TaskSetManager: Finished task 17.0 in stage 7580621.1 (TID 65577139) in 303870 ms on 10.0.246.111 (executor 24) (18/19) 18/02/11 13:22:20 INFO DAGScheduler: ShuffleMapStage 7580621 (start at command-2841337:340) finished in 303.880 s 18/02/11 13:22:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 7580621 (start at command-2841337:340) because some of its tasks had failed: 2, 15, 27, 28, 41 18/02/11 13:22:27 INFO DAGScheduler: Submitting ShuffleMapStage 7580621 (MapPartitionsRDD[2660062] at start at command-2841337:340), which has no missing parents 18/02/11 13:22:27 INFO DAGScheduler: Submitting 5 missing tasks from ShuffleMapStage 7580621 (MapPartitionsRDD[2660062] at start at command-2841337:340) (first 15 tasks are for partitions Vector(2, 15, 27, 28, 41)) 18/02/11 13:22:27 INFO TaskSchedulerImpl: Adding task set 7580621.2 with 5 tasks 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.IllegalStateException: more than one active taskSet for stage 7580621: 7580621.2,7580621.1 at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 18/02/11 13:22:27 INFO TaskSchedulerImpl: Cancelling stage 7580621 18/02/11 13:22:27 INFO TaskSchedulerImpl: Cancelling stage 7580621 18/02/11 13:22:27 INFO TaskSchedulerImpl: Stage 7580621 was cancelled 18/02/11 13:22:27 INFO DAGScheduler: ShuffleMapStage 7580621 (start at command-2841337:340) failed in 0.057 s due to Job aborted due to stage failure: Stage 7580621 cancelled org.apache.spark.SparkException: Job aborted due to stage failure: Stage 7580621 cancelled 18/02/11 13:22:27 WARN TaskSetManager: Lost task 18.0 in stage 7580621.1 (TID 65577140, 10.0.144.170, executor 16): TaskKilled (Stage cancelled) {code} According to the above logs, I think the issue is in this line: https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1281 "Task 18.0 in stage 7580621.0" finished and updated "shuffleStage.pendingPartitions" when "Task 18.0 in stage 7580621.1" was still running. Hence, when 18 of 19 tasks finished in "stage 7580621.1", this condition (https://github.com/apache/spark/blob/1dc2c1d5e85c5f404f470aeb44c1f3c22786bdea/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1284) would be true and trigger "stage 7580621.2". > java.lang.IllegalStateException: more than one active taskSet for stage > --- > > Key: SPARK-23433 > URL: https://issues.apache.org/jira/browse/SPARK-23433 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Shixiong Zhu >Priority: Major > > This following error thrown by DAGScheduler stopped the cluster: > {code} > 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage > 7580621: 7580621.2,7580621.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage
[jira] [Commented] (SPARK-23368) Avoid unnecessary Exchange or Sort after projection
[ https://issues.apache.org/jira/browse/SPARK-23368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366374#comment-16366374 ] Maryann Xue commented on SPARK-23368: - [~cloud_fan], [~smilegator], Could you please help review this PR? Thanks in advance! > Avoid unnecessary Exchange or Sort after projection > --- > > Key: SPARK-23368 > URL: https://issues.apache.org/jira/browse/SPARK-23368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maryann Xue >Priority: Minor > > After column rename projection, the ProjectExec's outputOrdering and > outputPartitioning should reflect the projected columns as well. For example, > {code:java} > SELECT b1 > FROM ( > SELECT a a1, b b1 > FROM testData2 > ORDER BY a > ) > ORDER BY a1{code} > The inner query is ordered on a1 as well. If we had a rule to eliminate Sort > on sorted result, together with this fix, the order-by in the outer query > could have been optimized out. > > Similarly, the below query > {code:java} > SELECT * > FROM ( > SELECT t1.a a1, t2.a a2, t1.b b1, t2.b b2 > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > ) > JOIN testData2 t3 > ON a1 = t3.a{code} > is equivalent to > {code:java} > SELECT * > FROM testData2 t1 > LEFT JOIN testData2 t2 > ON t1.a = t2.a > JOIN testData2 t3 > ON t1.a = t3.a{code} > , so the unnecessary sorting and hash-partitioning that have been optimized > out for the second query should have be eliminated in the first query as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23444) would like to be able to cancel jobs cleanly
Jose Torres created SPARK-23444: --- Summary: would like to be able to cancel jobs cleanly Key: SPARK-23444 URL: https://issues.apache.org/jira/browse/SPARK-23444 Project: Spark Issue Type: Wish Components: Spark Core Affects Versions: 2.4.0 Reporter: Jose Torres In Structured Streaming, we often need to cancel a Spark job in order to close the stream. SparkContext does not (as far as I can tell) provide a runJob handle which cleanly signals when a job was cancelled; it simply throws a generic SparkException. So we're forced to awkwardly parse this SparkException in order to determine whether the job failed because of a cancellation (which we expect and want to swallow) or another error (which we want to propagate). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366277#comment-16366277 ] Bago Amirbekian commented on SPARK-23265: - What's the status of this? Will this be a change in behaviour? > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22913) Hive Partition Pruning, Fractional and Timestamp types
[ https://issues.apache.org/jira/browse/SPARK-22913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ameen Tayyebi resolved SPARK-22913. --- Resolution: Won't Fix Resolving in favor of native Glue integration. These advanced predicates can't be supported because the version of Hive embedded in Spark does not support them. > Hive Partition Pruning, Fractional and Timestamp types > -- > > Key: SPARK-22913 > URL: https://issues.apache.org/jira/browse/SPARK-22913 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ameen Tayyebi >Priority: Major > > Spark currently pushes the predicates it has in the SQL query to Hive > Metastore. This only applies to predicates that are placed on top of > partitioning columns. As more and more hive metastore implementations come > around, this is an important optimization to allow data to be prefiltered to > only relevant partitions. Consider the following example: > Table: > create external table data (key string, quantity long) > partitioned by (processing-date timestamp) > Query: > select * from data where processing-date = '2017-10-23 00:00:00' > Currently, no filters will be pushed to the hive metastore for the above > query. The reason is that the code that tries to compute predicates to be > sent to hive metastore, only deals with integral and string column types. It > doesn't know how to handle fractional and timestamp columns. > I have tables in my metastore (AWS Glue) with millions of partitions of type > timestamp and double. In my specific case, it takes Spark's master node about > 6.5 minutes to download all partitions for the table, and then filter the > partitions client-side. The actual processing time of my query is only 6 > seconds. In other words, without partition pruning, I'm looking at 6.5 > minutes of processing and with partition pruning, I'm looking at 6 seconds > only. > I have a fix for this developed locally that I'll provide shortly as a pull > request. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23443) Spark with Glue as external catalog
Ameen Tayyebi created SPARK-23443: - Summary: Spark with Glue as external catalog Key: SPARK-23443 URL: https://issues.apache.org/jira/browse/SPARK-23443 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.0 Reporter: Ameen Tayyebi AWS Glue Catalog is an external Hive metastore backed by a web service. It allows permanent storage of catalog data for BigData use cases. To find out more information about AWS Glue, please consult: * AWS Glue - [https://aws.amazon.com/glue/] * Using Glue as a Metastore catalog for Spark - [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] Today, the integration of Glue and Spark is through the Hive layer. Glue implements the IMetaStore interface of Hive and for installations of Spark that contain Hive, Glue can be used as the metastore. The feature set that Glue supports does not align 1-1 with the set of features that the latest version of Spark supports. For example, Glue interface supports more advanced partition pruning that the latest version of Hive embedded in Spark. To enable a more natural integration with Spark and to allow leveraging latest features of Glue, without being coupled to Hive, a direct integration through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366252#comment-16366252 ] Apache Spark commented on SPARK-23413: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/20623 > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Blocker > Fix For: 2.3.0 > > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23413: - Affects Version/s: (was: 2.4.0) > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Blocker > Fix For: 2.3.0 > > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366199#comment-16366199 ] Imran Rashid commented on SPARK-23413: -- This was fixed by https://github.com/apache/spark/pull/20601 in master, and https://github.com/apache/spark/pull/20623 in branch-2.3 (because of a mistake I made while merging) > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Blocker > Fix For: 2.3.0 > > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23413. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20623 [https://github.com/apache/spark/pull/20623] > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Blocker > Fix For: 2.3.0 > > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23413: Assignee: Attila Zsolt Piros > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Blocker > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-23173: - Labels: release-notes (was: ) > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366162#comment-16366162 ] Igor Berman edited comment on SPARK-23423 at 2/15/18 7:30 PM: -- [~skonto] I'll at Sunday probably. I'm not loosing agents. was (Author: igor.berman): [~skonto] I'll at Sunday probably > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366162#comment-16366162 ] Igor Berman commented on SPARK-23423: - [~skonto] I'll at Sunday probably > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23377. --- Resolution: Fixed Fix Version/s: 2.4.0 2.3.1 Resolved in master and branch-2.3 with https://github.com/apache/spark/pull/20594 > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Liang-Chi Hsieh >Priority: Blocker > Fix For: 2.3.1, 2.4.0 > > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23430) Cannot sort "Executor ID" or "Host" columns in the task table
[ https://issues.apache.org/jira/browse/SPARK-23430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-23430. -- Resolution: Duplicate > Cannot sort "Executor ID" or "Host" columns in the task table > - > > Key: SPARK-23430 > URL: https://issues.apache.org/jira/browse/SPARK-23430 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Blocker > Labels: regression > > Click the "Executor ID" or "Host" header in the task table and it will fail: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID > at org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1009) > at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:686) > at org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) > at org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) > at org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:700) > at org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) > at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) > at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:98) > at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pranav Rao updated SPARK-23442: --- Description: Through the DataFrameWriter[T] interface I have created a external HIVE table with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the dataset is 600GB and the provider is Parquet. Now this works great when joining with a similarly bucketed dataset - it's able to avoid a shuffle. But any action on this Dataframe(from _spark.table("tablename")_), works with only 50 RDD partitions. This is happening because of [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. So the 600GB dataset is only read through 50 tasks, which makes this partitioning + bucketing scheme not useful. I cannot expose the base directory of the parquet folder for reading the dataset, because the partition locations don't follow a (basePath + partSpec) format. Meanwhile, are there workarounds to use higher parallelism while reading such a table? Let me know if I can help in any way. was: Through the DataFrameWriter[T] interface I have created a external HIVE table with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the dataset is 600GB and the provider is Parquet. Now this works great when joining with a similarly bucketed dataset - it's able to avoid a shuffle. But any action on this Dataframe(from _spark.table("tablename")_), works with only 50 RDD partitions. This is happening because of [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. So the 600GB dataset is only read through 50 tasks, which makes this partitioning + bucketing scheme not useful at all. I cannot expose the base directory of the parquet folder for reading the dataset, because the partition locations don't follow a (basePath + partSpec) format. Meanwhile, are there workarounds to use higher parallelism while reading such a table? Let me know if I can help in any way. > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pranav Rao updated SPARK-23442: --- Description: Through the DataFrameWriter[T] interface I have created a external HIVE table with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the dataset is 600GB and the provider is Parquet. Now this works great when joining with a similarly bucketed dataset - it's able to avoid a shuffle. But any action on this Dataframe(from _spark.table("tablename")_), works with only 50 RDD partitions. This is happening because of [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. So the 600GB dataset is only read through 50 tasks, which makes this partitioning + bucketing scheme not useful at all. I cannot expose the base directory of the parquet folder for reading the dataset, because the partition locations don't follow a (basePath + partSpec) format. Meanwhile, are there workarounds to use higher parallelism while reading such a table? Let me know if I can help in any way. was: Through the DataFrameWriter[T] interface I have created a external HIVE table with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the dataset is 600GB and the provider is Parquet. Now this works great when joining with a similarly bucketed dataset - it's able to avoid a shuffle. But any action on this Dataframe(from _spark.table("tablename")_), works with only 50 RDD partitions. This is happening because of [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. So the 600GB dataset is only read through 50 tasks, which makes this partitioning + bucketing scheme not useful at all. I cannot expose the base directory of the parquet folder for reading the dataset, because the partition locations don't follow a (basePath + partSpec) format. Meanwhile, are there workarounds to use higher parallelism while reading such a table? Let me know if we > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful at all. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? > Let me know if I can help in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
[ https://issues.apache.org/jira/browse/SPARK-23442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pranav Rao updated SPARK-23442: --- Environment: (was: spark.sql("SET spark.default.parallelism=1000") {{spark.sql("set spark.sql.shuffle.partitions=500") }} {{spark.sql("set spark.sql.files.maxPartitionBytes=134217728")}} {{-}} {{$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize}} 0 $ hdfs getconf -confKey dfs.blocksize 134217728 $ hdfs getconf -confKey mapreduce.job.maps 32) > Reading from partitioned and bucketed table uses only bucketSpec.numBuckets > partitions in all cases > --- > > Key: SPARK-23442 > URL: https://issues.apache.org/jira/browse/SPARK-23442 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Pranav Rao >Priority: Major > > Through the DataFrameWriter[T] interface I have created a external HIVE table > with 5000 (horizontal) partitions and 50 buckets in each partition. Overall > the dataset is 600GB and the provider is Parquet. > Now this works great when joining with a similarly bucketed dataset - it's > able to avoid a shuffle. > But any action on this Dataframe(from _spark.table("tablename")_), works with > only 50 RDD partitions. This is happening because of > [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. > So the 600GB dataset is only read through 50 tasks, which makes this > partitioning + bucketing scheme not useful at all. > I cannot expose the base directory of the parquet folder for reading the > dataset, because the partition locations don't follow a (basePath + partSpec) > format. > Meanwhile, are there workarounds to use higher parallelism while reading such > a table? Let me know if we -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23442) Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases
Pranav Rao created SPARK-23442: -- Summary: Reading from partitioned and bucketed table uses only bucketSpec.numBuckets partitions in all cases Key: SPARK-23442 URL: https://issues.apache.org/jira/browse/SPARK-23442 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.2.1 Environment: spark.sql("SET spark.default.parallelism=1000") {{spark.sql("set spark.sql.shuffle.partitions=500") }} {{spark.sql("set spark.sql.files.maxPartitionBytes=134217728")}} {{-}} {{$ hdfs getconf -confKey mapreduce.input.fileinputformat.split.minsize}} 0 $ hdfs getconf -confKey dfs.blocksize 134217728 $ hdfs getconf -confKey mapreduce.job.maps 32 Reporter: Pranav Rao Through the DataFrameWriter[T] interface I have created a external HIVE table with 5000 (horizontal) partitions and 50 buckets in each partition. Overall the dataset is 600GB and the provider is Parquet. Now this works great when joining with a similarly bucketed dataset - it's able to avoid a shuffle. But any action on this Dataframe(from _spark.table("tablename")_), works with only 50 RDD partitions. This is happening because of [createBucketedReadRDD|https://github.com/apachttps:/github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.she/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.sc]. So the 600GB dataset is only read through 50 tasks, which makes this partitioning + bucketing scheme not useful at all. I cannot expose the base directory of the parquet folder for reading the dataset, because the partition locations don't follow a (basePath + partSpec) format. Meanwhile, are there workarounds to use higher parallelism while reading such a table? Let me know if we -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23377: - Assignee: Liang-Chi Hsieh > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Liang-Chi Hsieh >Priority: Blocker > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365991#comment-16365991 ] Piotr Kołaczkowski commented on SPARK-14540: Any progress on this? Are you planning to finalize this by the time Scala 2.13 is stable? > Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner > > > Key: SPARK-14540 > URL: https://issues.apache.org/jira/browse/SPARK-14540 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Josh Rosen >Priority: Major > > Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running > ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures: > {code} > [info] - toplevel return statements in closures are identified at cleaning > time *** FAILED *** (32 milliseconds) > [info] Expected exception > org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no > exception was thrown. (ClosureCleanerSuite.scala:57) > {code} > and > {code} > [info] - user provided closures are actually cleaned *** FAILED *** (56 > milliseconds) > [info] Expected ReturnStatementInClosureException, but got > org.apache.spark.SparkException: Job aborted due to stage failure: Task not > serializable: java.io.NotSerializableException: java.lang.Object > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class > org.apache.spark.util.TestUserClosuresActuallyCleaned$, > functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I, > implementation=invokeStatic > org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I, > instantiatedMethodType=(I)I, numCaptured=1]) > [info]- element of array (index: 0) > [info]- array (class "[Ljava.lang.Object;", size: 1) > [info]- field (class "java.lang.invoke.SerializedLambda", name: > "capturedArgs", type: "class [Ljava.lang.Object;") > [info]- object (class "java.lang.invoke.SerializedLambda", > SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, > functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > > instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, > numCaptured=1]) > [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: > "f", type: "interface scala.Function3") > [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", > MapPartitionsRDD[2] at apply at Transformer.scala:22) > [info]- field (class "scala.Tuple2", name: "_1", type: "class > java.lang.Object") > [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at > apply at > Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)). > [info] This means the closure provided by user is not actually cleaned. > (ClosureCleanerSuite.scala:78) > {code} > We'll need to figure out a closure cleaning strategy which works for 2.12 > lambdas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23441) Remove interrupts from ContinuousExecution
[ https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23441: Assignee: (was: Apache Spark) > Remove interrupts from ContinuousExecution > -- > > Key: SPARK-23441 > URL: https://issues.apache.org/jira/browse/SPARK-23441 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > The reason StreamExecution interrupts the query execution thread is that, for > the microbatch case, nontrivial work goes on in that thread to construct a > batch. In ContinuousExecution, this doesn't apply. Once the state is flipped > from ACTIVE and the underlying job is cancelled, the query execution thread > will immediately go to cleanup. So we don't need to call > queryExecutionThread.interrupt() at all there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23441) Remove interrupts from ContinuousExecution
[ https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365984#comment-16365984 ] Apache Spark commented on SPARK-23441: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20622 > Remove interrupts from ContinuousExecution > -- > > Key: SPARK-23441 > URL: https://issues.apache.org/jira/browse/SPARK-23441 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > The reason StreamExecution interrupts the query execution thread is that, for > the microbatch case, nontrivial work goes on in that thread to construct a > batch. In ContinuousExecution, this doesn't apply. Once the state is flipped > from ACTIVE and the underlying job is cancelled, the query execution thread > will immediately go to cleanup. So we don't need to call > queryExecutionThread.interrupt() at all there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23441) Remove interrupts from ContinuousExecution
[ https://issues.apache.org/jira/browse/SPARK-23441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23441: Assignee: Apache Spark > Remove interrupts from ContinuousExecution > -- > > Key: SPARK-23441 > URL: https://issues.apache.org/jira/browse/SPARK-23441 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Minor > > The reason StreamExecution interrupts the query execution thread is that, for > the microbatch case, nontrivial work goes on in that thread to construct a > batch. In ContinuousExecution, this doesn't apply. Once the state is flipped > from ACTIVE and the underlying job is cancelled, the query execution thread > will immediately go to cleanup. So we don't need to call > queryExecutionThread.interrupt() at all there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23441) Remove interrupts from ContinuousExecution
Jose Torres created SPARK-23441: --- Summary: Remove interrupts from ContinuousExecution Key: SPARK-23441 URL: https://issues.apache.org/jira/browse/SPARK-23441 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres The reason StreamExecution interrupts the query execution thread is that, for the microbatch case, nontrivial work goes on in that thread to construct a batch. In ContinuousExecution, this doesn't apply. Once the state is flipped from ACTIVE and the underlying job is cancelled, the query execution thread will immediately go to cleanup. So we don't need to call queryExecutionThread.interrupt() at all there. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23440) Clean up StreamExecution interrupts
Jose Torres created SPARK-23440: --- Summary: Clean up StreamExecution interrupts Key: SPARK-23440 URL: https://issues.apache.org/jira/browse/SPARK-23440 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Jose Torres StreamExecution currently heavily leverages interrupt() to stop the query execution thread. But the query execution thread is sometimes in the middle of a context that will wrap or convert the InterruptedException, so we maintain a whitelist of exceptions that we think indicate an exception caused by stop rather than an error condition. This is awkward and probably fragile. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23436: Assignee: Apache Spark > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Assignee: Apache Spark >Priority: Major > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23436: Assignee: (was: Apache Spark) > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Priority: Major > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365959#comment-16365959 ] Apache Spark commented on SPARK-23436: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20621 > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Priority: Major > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365931#comment-16365931 ] Kazuaki Ishizaki commented on SPARK-23415: -- I realized that an issue in this test case. I think that this test case expects to try memory allocation in {{grow()} method four times. However, since the allocated memory is reused, this test performed memory allocation only once with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other sizes. If I updated this test to perform four memory allocations with {{4g}} heap size, an exception occurs. > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > The test suite fails due to 60-second timeout sometimes. > {code} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365931#comment-16365931 ] Kazuaki Ishizaki edited comment on SPARK-23415 at 2/15/18 5:02 PM: --- I realized that an issue in this test case. I think that this test case expects to try memory allocation in {{grow()}} method four times. However, since the allocated memory is reused, this test performed memory allocation only once with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other sizes. If I updated this test to perform four memory allocations with {{4g}} heap size, an exception occurs. was (Author: kiszk): I realized that an issue in this test case. I think that this test case expects to try memory allocation in {{grow()} method four times. However, since the allocated memory is reused, this test performed memory allocation only once with {{roundToWord(ARRAY_MAX / 2)}}. No memory allocations occur with other sizes. If I updated this test to perform four memory allocations with {{4g}} heap size, an exception occurs. > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > The test suite fails due to 60-second timeout sometimes. > {code} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365929#comment-16365929 ] Simon Dirmeier commented on SPARK-23437: Great suggestion. If there is a way to contribute I'd love to. > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes
[ https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23438: Assignee: (was: Apache Spark) > DStreams could lose blocks with WAL enabled when driver crashes > --- > > Key: SPARK-23438 > URL: https://issues.apache.org/jira/browse/SPARK-23438 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Gabor Somogyi >Priority: Critical > > There is a race condition introduced in SPARK-11141 which could cause data > loss. > This affects all versions since 1.6.0. > Problematic situation: > # Start streaming job with 2 receivers with WAL enabled. > # Receiver 1 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > ** Puts the block into it's received block queue with ID 1 > # Receiver 2 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > # Spark allocates all blocks from it's received block queue and writes > AllocatedBlocks(IDs=(1)) into WAL > # Driver crashes > # New Driver recovers from WAL > # Realise block with ID 2 never processed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes
[ https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365928#comment-16365928 ] Apache Spark commented on SPARK-23438: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/20620 > DStreams could lose blocks with WAL enabled when driver crashes > --- > > Key: SPARK-23438 > URL: https://issues.apache.org/jira/browse/SPARK-23438 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Gabor Somogyi >Priority: Critical > > There is a race condition introduced in SPARK-11141 which could cause data > loss. > This affects all versions since 1.6.0. > Problematic situation: > # Start streaming job with 2 receivers with WAL enabled. > # Receiver 1 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > ** Puts the block into it's received block queue with ID 1 > # Receiver 2 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > # Spark allocates all blocks from it's received block queue and writes > AllocatedBlocks(IDs=(1)) into WAL > # Driver crashes > # New Driver recovers from WAL > # Realise block with ID 2 never processed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes
[ https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23438: Assignee: Apache Spark > DStreams could lose blocks with WAL enabled when driver crashes > --- > > Key: SPARK-23438 > URL: https://issues.apache.org/jira/browse/SPARK-23438 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Critical > > There is a race condition introduced in SPARK-11141 which could cause data > loss. > This affects all versions since 1.6.0. > Problematic situation: > # Start streaming job with 2 receivers with WAL enabled. > # Receiver 1 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > ** Puts the block into it's received block queue with ID 1 > # Receiver 2 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > # Spark allocates all blocks from it's received block queue and writes > AllocatedBlocks(IDs=(1)) into WAL > # Driver crashes > # New Driver recovers from WAL > # Realise block with ID 2 never processed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23390: Assignee: Wenchen Fan (was: Apache Spark) > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365927#comment-16365927 ] Apache Spark commented on SPARK-23390: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20619 > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23390: Assignee: Apache Spark (was: Wenchen Fan) > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark >Priority: Major > Fix For: 2.3.0 > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23340) Upgrade Apache ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23340: Priority: Major (was: Blocker) > Upgrade Apache ORC to 1.4.3 > --- > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > > This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. > Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 > more patches including bug fixes (https://s.apache.org/Fll8). > Especially, the following ORC-285 is fixed at 1.4.3. > {code} > scala> val df = Seq(Array.empty[Float]).toDF() > scala> df.write.format("orc").save("/tmp/floatarray") > scala> spark.read.orc("/tmp/floatarray") > res1: org.apache.spark.sql.DataFrame = [value: array] > scala> spark.read.orc("/tmp/floatarray").show() > 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.io.IOException: Error reading file: > file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) > ... > Caused by: java.io.EOFException: Read past EOF for compressed stream Stream > for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23340) Upgrade Apache ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23340. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 > Upgrade Apache ORC to 1.4.3 > --- > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 2.3.0 > > > This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. > Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 > more patches including bug fixes (https://s.apache.org/Fll8). > Especially, the following ORC-285 is fixed at 1.4.3. > {code} > scala> val df = Seq(Array.empty[Float]).toDF() > scala> df.write.format("orc").save("/tmp/floatarray") > scala> spark.read.orc("/tmp/floatarray") > res1: org.apache.spark.sql.DataFrame = [value: array] > scala> spark.read.orc("/tmp/floatarray").show() > 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.io.IOException: Error reading file: > file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) > ... > Caused by: java.io.EOFException: Read past EOF for compressed stream Stream > for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23426) Use `hive` ORC impl and disable PPD for Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23426. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 > Use `hive` ORC impl and disable PPD for Spark 2.3.0 > --- > > Key: SPARK-23426 > URL: https://issues.apache.org/jira/browse/SPARK-23426 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365904#comment-16365904 ] Lucas Partridge commented on SPARK-17025: - What's the up-to-date status for this please? The previous comment implies testing is still needed whereas the article at [https://databricks.com/blog/2017/08/30/developing-custom-machine-learning-algorithms-in-pyspark.html] seems to imply everything's ready. Does anyone know which version of Spark this will make it into? > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Description: We're seeing multiple failures in {{FileBasedDataSourceSuite}} in {{spark-branch-2.3-test-sbt-hadoop-2.7}}: {code} org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.01215805999 seconds. Last failure message: There are 1 possibly leaked file streams.. {code} Here's the full history: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ >From a very quick look, these failures seem to be correlated with >https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from >the following stack trace (full logs >[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code} [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem connection created at: java.lang.Throwable at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) {code} Also, while this might be just a false correlation but the frequency of these test failures have increased considerably in https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ after https://github.com/apache/spark/pull/20562 (cc [~feng...@databricks.com]) was merged. The following is Parquet leakage. {code} Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) {code} was: We're seeing multiple failures in {{FileBasedDataSourceSuite}} in {{spark-branch-2.3-test-sbt-hadoop-2.7}}: {code} org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 15 times over 10.01215805999 seconds. Last failure message: There are 1 possibly leaked file streams.. {code} Here's the full history: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ >From a very quick look, these failures seem to be correlated with >https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from >the following stack trace (full logs >[here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > {code} [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) 15:55:58
[jira] [Reopened] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-23390: --- I'm reopening this issue due to Parquet leakage is detected. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/205/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ {code} Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null at org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) at org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) {code} > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23439) Ambiguous reference when selecting column inside StructType with same name that outer colum
Alejandro Trujillo Caballero created SPARK-23439: Summary: Ambiguous reference when selecting column inside StructType with same name that outer colum Key: SPARK-23439 URL: https://issues.apache.org/jira/browse/SPARK-23439 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: Scala 2.11.8, Spark 2.2.0 Reporter: Alejandro Trujillo Caballero Hi. I've seen that when working with nested struct fields in a DataFrame and doing a select operation the nesting is lost and this can result in collisions between column names. For example: {code:java} case class Foo(a: Int, b: Bar) case class Bar(a: Int) val items = List( Foo(1, Bar(1)), Foo(2, Bar(2)) ) val df = spark.createDataFrame(items) val df_a_a = df.select($"a", $"b.a").show //+---+---+ //| a| a| //+---+---+ //| 1| 1| //| 2| 2| //+---+---+ df.select($"a", $"b.a").printSchema //root //|-- a: integer (nullable = false) //|-- a: integer (nullable = true) df.select($"a", $"b.a").select($"a") //org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#9, a#{code} Shouldn't the second column be named "b.a"? Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes
[ https://issues.apache.org/jira/browse/SPARK-23438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365861#comment-16365861 ] Gabor Somogyi commented on SPARK-23438: --- I'm working on that. > DStreams could lose blocks with WAL enabled when driver crashes > --- > > Key: SPARK-23438 > URL: https://issues.apache.org/jira/browse/SPARK-23438 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Gabor Somogyi >Priority: Critical > > There is a race condition introduced in SPARK-11141 which could cause data > loss. > This affects all versions since 1.6.0. > Problematic situation: > # Start streaming job with 2 receivers with WAL enabled. > # Receiver 1 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > ** Puts the block into it's received block queue with ID 1 > # Receiver 2 receives a block and does the following > ** Writes a BlockAdditionEvent into WAL > # Spark allocates all blocks from it's received block queue and writes > AllocatedBlocks(IDs=(1)) into WAL > # Driver crashes > # New Driver recovers from WAL > # Realise block with ID 2 never processed > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23438) DStreams could lose blocks with WAL enabled when driver crashes
Gabor Somogyi created SPARK-23438: - Summary: DStreams could lose blocks with WAL enabled when driver crashes Key: SPARK-23438 URL: https://issues.apache.org/jira/browse/SPARK-23438 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.0 Reporter: Gabor Somogyi There is a race condition introduced in SPARK-11141 which could cause data loss. This affects all versions since 1.6.0. Problematic situation: # Start streaming job with 2 receivers with WAL enabled. # Receiver 1 receives a block and does the following ** Writes a BlockAdditionEvent into WAL ** Puts the block into it's received block queue with ID 1 # Receiver 2 receives a block and does the following ** Writes a BlockAdditionEvent into WAL # Spark allocates all blocks from it's received block queue and writes AllocatedBlocks(IDs=(1)) into WAL # Driver crashes # New Driver recovers from WAL # Realise block with ID 2 never processed -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib
[ https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valeriy Avanesov updated SPARK-23437: - Summary: [ML] Distributed Gaussian Process Regression for MLlib (was: Distributed Gaussian Process Regression for MLlib) > [ML] Distributed Gaussian Process Regression for MLlib > -- > > Key: SPARK-23437 > URL: https://issues.apache.org/jira/browse/SPARK-23437 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Affects Versions: 2.2.1 >Reporter: Valeriy Avanesov >Priority: Major > > Gaussian Process Regression (GP) is a well known black box non-linear > regression approach [1]. For years the approach remained inapplicable to > large samples due to its cubic computational complexity, however, more recent > techniques (Sparse GP) allowed for only linear complexity. The field > continues to attracts interest of the researches – several papers devoted to > GP were present on NIPS 2017. > Unfortunately, non-parametric regression techniques coming with mllib are > restricted to tree-based approaches. > I propose to create and include an implementation (which I am going to work > on) of so-called robust Bayesian Committee Machine proposed and investigated > in [2]. > [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian > Processes for Machine Learning (Adaptive Computation and Machine Learning)_. > The MIT Press. > [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian > processes. In _Proceedings of the 32nd International Conference on > International Conference on Machine Learning - Volume 37_ (ICML'15), Francis > Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23437) Distributed Gaussian Process Regression for MLlib
Valeriy Avanesov created SPARK-23437: Summary: Distributed Gaussian Process Regression for MLlib Key: SPARK-23437 URL: https://issues.apache.org/jira/browse/SPARK-23437 Project: Spark Issue Type: New Feature Components: ML, MLlib Affects Versions: 2.2.1 Reporter: Valeriy Avanesov Gaussian Process Regression (GP) is a well known black box non-linear regression approach [1]. For years the approach remained inapplicable to large samples due to its cubic computational complexity, however, more recent techniques (Sparse GP) allowed for only linear complexity. The field continues to attracts interest of the researches – several papers devoted to GP were present on NIPS 2017. Unfortunately, non-parametric regression techniques coming with mllib are restricted to tree-based approaches. I propose to create and include an implementation (which I am going to work on) of so-called robust Bayesian Committee Machine proposed and investigated in [2]. [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)_. The MIT Press. [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian processes. In _Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37_ (ICML'15), Francis Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365780#comment-16365780 ] Marco Gaido commented on SPARK-23436: - Thanks for reporting this. This affects also current branch. I am working on it. > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Priority: Major > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468 ] Stavros Kontopoulos edited comment on SPARK-23423 at 2/15/18 3:37 PM: -- The delivery of mesos task updates is a responsibility of mesos master AFIK. Btw you should see the status of tasks on the agent log (assuming you are not loosing agents right?) where the mesos executor was running on. [~susanxhuynh] is it possible that under heavy load the task updates are not delivered? [~igor.berman] could you attach any log available? was (Author: skonto): The delivery of mesos task updates is a responsibility of mesos master AFIK. Btw you should see the status of tasks on the agent log (assuming you are not loosing agents right?) where the mesos executor was running on. [~susanxhuynh] is it possible that under heavy load the task updates are not delivered? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23436) Incorrect Date column Inference in partition discovery
Apoorva Sareen created SPARK-23436: -- Summary: Incorrect Date column Inference in partition discovery Key: SPARK-23436 URL: https://issues.apache.org/jira/browse/SPARK-23436 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Apoorva Sareen If a Partition column appears to partial date/timestamp example : 2018-01-01-23 where it is only truncated upto an hour then the data types of the partitioning columns are automatically inferred as date however, the values are loaded as null. Here is an example code to reproduce this behaviour {code:java} val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", "date_month", "data_hour", "data") data.write.partitionBy("id","date_month","data_hour").parquet("output/test") val input = spark.read.parquet("output/test") input.printSchema() input.show() ## Result ### root |-- data: string (nullable = true) |-- id: integer (nullable = true) |-- date_month: string (nullable = true) |-- data_hour: date (nullable = true) ++---+--+-+ |data| id|date_month|data_hour| ++---+--+-+ |test| 1| 2018-01| null| ++---+--+-+{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21302) history server WebUI show HTTP ERROR 500
[ https://issues.apache.org/jira/browse/SPARK-21302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365691#comment-16365691 ] bharath kumar commented on SPARK-21302: --- >From what i have noticed that Jobs are progressing even-though we have NPE . >Once the jobs are completed , we can view the stats for Application Master > history server WebUI show HTTP ERROR 500 > > > Key: SPARK-21302 > URL: https://issues.apache.org/jira/browse/SPARK-21302 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.1 >Reporter: Jason Pan >Priority: Major > Attachments: npe.PNG, nullpointer.PNG > > > When navigate to history server WebUI, and check incomplete applications, > show http 500 > Error logs: > 17/07/05 20:17:44 INFO ApplicationCacheCheckFilter: Application Attempt > app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/None updated; > refreshing > 17/07/05 20:17:44 WARN ServletHandler: > /history/app-20170705201715-0005-0ce78623-38db-4d23-a2b2-8cb45bb3f505/executors/ > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:785) > 17/07/05 20:18:00 WARN ServletHandler: / > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) > at > org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) > at > org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.spark_project.jetty.servlets.gzip.GzipHandler.handle(GzipHandler.java:479) > at > org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.spark_project.jetty.server.Server.handle(Server.java:499) > at > org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) > at > org.spark_project.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544) > at > org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) > at > org.spark_project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) > at java.lang.Thread.run(Thread.java:785) > 17/07/05 20:18:17 WARN ServletHandler: / > java.lang.NullPointerException > at > org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) > at > org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23410) Unable to read jsons in charset different from UTF-8
[ https://issues.apache.org/jira/browse/SPARK-23410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365666#comment-16365666 ] Hyukjin Kwon commented on SPARK-23410: -- It's reverted in https://github.com/apache/spark/pull/20614 anyway. Shall we unset the blocker? > Unable to read jsons in charset different from UTF-8 > > > Key: SPARK-23410 > URL: https://issues.apache.org/jira/browse/SPARK-23410 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Blocker > Attachments: utf16WithBOM.json > > > Currently the Json Parser is forced to read json files in UTF-8. Such > behavior breaks backward compatibility with Spark 2.2.1 and previous versions > that can read json files in UTF-16, UTF-32 and other encodings due to using > of the auto detection mechanism of the jackson library. Need to give back to > users possibility to read json files in specified charset and/or detect > charset automatically as it was before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23405) The task will hang up when a small table left semi join a big table
[ https://issues.apache.org/jira/browse/SPARK-23405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23405. --- Resolution: Invalid > The task will hang up when a small table left semi join a big table > --- > > Key: SPARK-23405 > URL: https://issues.apache.org/jira/browse/SPARK-23405 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei >Priority: Major > Attachments: SQL.png, taskhang up.png > > > I run a sql: `select ls.cs_order_number from ls left semi join catalog_sales > cs on ls.cs_order_number = cs.cs_order_number`, The `ls` table is a small > table ,and the number is one. The `catalog_sales` table is a big table, and > the number is 10 billion. The task will be hang up: > !taskhang up.png! > And the sql page is : > !SQL.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database
[ https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365549#comment-16365549 ] Sean Owen commented on SPARK-23402: --- This is all just the master branch of github.com/apache/spark and yes you need to build it yourself. > Dataset write method not working as expected for postgresql database > > > Key: SPARK-23402 > URL: https://issues.apache.org/jira/browse/SPARK-23402 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 > Environment: PostgreSQL: 9.5.8 (10 + Also same issue) > OS: Cent OS 7 & Windows 7,8 > JDBC: 9.4-1201-jdbc41 > > Spark: I executed in both 2.1.0 and 2.2.1 > Mode: Standalone > OS: Windows 7 >Reporter: Pallapothu Jyothi Swaroop >Priority: Major > Attachments: Emsku[1].jpg > > > I am using spark dataset write to insert data on postgresql existing table. > For this I am using write method mode as append mode. While using i am > getting exception like table already exists. But, I gave option as append > mode. > It's strange. When i change options to sqlserver/oracle append mode is > working as expected. > > *Database Properties:* > {{destinationProps.put("driver", "org.postgresql.Driver"); > destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); > destinationProps.put("user", "dbmig");}} > {{destinationProps.put("password", "dbmig");}} > > *Dataset Write Code:* > {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"), > "dqvalue", destinationdbProperties);}} > > > {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: > relation "dqvalue" already exists at > org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) > at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at > org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at > org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at > org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at > org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at > org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at > com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25) > at > com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81) > at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at > com.ads.dqam.Client.main(Client.java:71)}} > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException
[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365495#comment-16365495 ] Steve Loughran commented on SPARK-23308: I'm going to recommend this is closed as a WONTFIX. Not the place of Spark to determine what is recoverable —it'd never get it right > ignoreCorruptFiles should not ignore retryable IOException > -- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Márcio Furlani Carmona >Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Berman updated SPARK-23423: Labels: Mesos dynamic_allocation (was: ) > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468 ] Stavros Kontopoulos edited comment on SPARK-23423 at 2/15/18 12:31 PM: --- The delivery of mesos task updates is a responsibility of mesos master AFIK. Btw you should see the status of tasks on the agent log (assuming you are not loosing agents right?) where the mesos executor was running on. [~susanxhuynh] is it possible that under heavy load the task updates are not delivered? was (Author: skonto): The task updates delivery is a responsibility of mesos master AFIK. Btw you should see the status of tasks on the agent log (assuming you are not loosing agents right?) where the mesos executor was running on. [~susanxhuynh] is it possible that under heavy load the task updates are not delivered? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365468#comment-16365468 ] Stavros Kontopoulos commented on SPARK-23423: - The task updates delivery is a responsibility of mesos master AFIK. Btw you should see the status of tasks on the agent log (assuming you are not loosing agents right?) where the mesos executor was running on. [~susanxhuynh] is it possible that under heavy load the task updates are not delivered? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365438#comment-16365438 ] Igor Berman commented on SPARK-23423: - [~skonto] do you think it's could be connected to https://issues.apache.org/jira/browse/SPARK-21460 ? The Jira is connected to Yarn, but seems like it might be same cause > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365436#comment-16365436 ] Igor Berman commented on SPARK-23423: - [~skonto], yes this is correct. Besides TASK_RUNNING reports(many) there are only 2 TASK_FAILURE reports present(there are no FINISHED, LOST tasks) We have cluster with approx 100 machines and for this specific framework many executors were running (40 or so), many of them were killed during scale down(>> 2). I can see in mesos ui that there are completed tasks with KILLED state for this application. the question is where all other reports? Can you think of any reason they won't be reported or driver won't register it? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23422) YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1
[ https://issues.apache.org/jira/browse/SPARK-23422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23422: -- Assignee: Gabor Somogyi > YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1 > --- > > Key: SPARK-23422 > URL: https://issues.apache.org/jira/browse/SPARK-23422 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > YarnShuffleIntegrationSuite fails when SPARK_PREPEND_CLASSES set to 1. > Normally mllib built before yarn module. When SPARK_PREPEND_CLASSES used > mllib classes are on yarn test classpath. > Before 2.3 that did not cause issues. But 2.3 has SPARK-22450, which > registered some mllib classes with the kryo serializer. Now it dies with the > following error: > {code:java} > 18/02/13 07:33:29 INFO SparkContext: Starting job: collect at > YarnShuffleIntegrationSuite.scala:143 > Exception in thread "dag-scheduler-event-loop" > java.lang.NoClassDefFoundError: breeze/linalg/DenseMatrix > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23422) YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1
[ https://issues.apache.org/jira/browse/SPARK-23422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23422. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 20608 [https://github.com/apache/spark/pull/20608] > YarnShuffleIntegrationSuite failure when SPARK_PREPEND_CLASSES set to 1 > --- > > Key: SPARK-23422 > URL: https://issues.apache.org/jira/browse/SPARK-23422 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 2.4.0, 2.3.1 > > > YarnShuffleIntegrationSuite fails when SPARK_PREPEND_CLASSES set to 1. > Normally mllib built before yarn module. When SPARK_PREPEND_CLASSES used > mllib classes are on yarn test classpath. > Before 2.3 that did not cause issues. But 2.3 has SPARK-22450, which > registered some mllib classes with the kryo serializer. Now it dies with the > following error: > {code:java} > 18/02/13 07:33:29 INFO SparkContext: Starting job: collect at > YarnShuffleIntegrationSuite.scala:143 > Exception in thread "dag-scheduler-event-loop" > java.lang.NoClassDefFoundError: breeze/linalg/DenseMatrix > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365295#comment-16365295 ] Stavros Kontopoulos commented on SPARK-23423: - [~igor.berman] From the log I see that the scheduler received two task updates with state TASK_FAILURE, is that correct? The executor removal is triggered if the status of the task received belongs to any of the following states: FINISHED, FAILED, KILLED, LOST. [https://github.com/apache/spark/blob/ed8647609883fcef16be5d24c2cb4ebda25bd6f0/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L640] [https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/TaskState.scala#L22] > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23435) R tests should support latest testthat
Felix Cheung created SPARK-23435: Summary: R tests should support latest testthat Key: SPARK-23435 URL: https://issues.apache.org/jira/browse/SPARK-23435 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.3.1, 2.4.0 Reporter: Felix Cheung To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was released in Dec 2017, and its method has been changed. In order for our tests to keep working, we need to detect that and call a different method. Jenkins is running 1.0.1 though, we need to check if it is going to work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23359) Adds an alias 'names' of 'fieldNames' in Scala's StructType
[ https://issues.apache.org/jira/browse/SPARK-23359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23359. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20545 [https://github.com/apache/spark/pull/20545] > Adds an alias 'names' of 'fieldNames' in Scala's StructType > --- > > Key: SPARK-23359 > URL: https://issues.apache.org/jira/browse/SPARK-23359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.4.0 > > > See the discussion in SPARK-20090. It happened to deprecate {{names}} in > Python side but seems we better add an alias {{names}} in Scala side too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263 ] Felix Cheung edited comment on SPARK-22817 at 2/15/18 9:13 AM: --- I should have caught this - -we need to fix the test because it will fail in CRAN - another option is to fix the dependency version in DESCRIPTION file- scratch that. in CRAN we are calling test_package, which works fine. was (Author: felixcheung): I should have caught this - we need to fix the test because it will fail in CRAN - another option is to fix the dependency version in DESCRIPTION file > Use fixed testthat version for SparkR tests in AppVeyor > --- > > Key: SPARK-22817 > URL: https://issues.apache.org/jira/browse/SPARK-22817 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > We happened to access to the internal {{run_tests}} - > https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. > https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58 > This seems removed out in 2.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23359) Adds an alias 'names' of 'fieldNames' in Scala's StructType
[ https://issues.apache.org/jira/browse/SPARK-23359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23359: --- Assignee: Hyukjin Kwon > Adds an alias 'names' of 'fieldNames' in Scala's StructType > --- > > Key: SPARK-23359 > URL: https://issues.apache.org/jira/browse/SPARK-23359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.4.0 > > > See the discussion in SPARK-20090. It happened to deprecate {{names}} in > Python side but seems we better add an alias {{names}} in Scala side too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23366) Improve hot reading path in ReadAheadInputStream
[ https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23366. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20555 [https://github.com/apache/spark/pull/20555] > Improve hot reading path in ReadAheadInputStream > > > Key: SPARK-23366 > URL: https://issues.apache.org/jira/browse/SPARK-23366 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 2.4.0 > > > ReadAheadInputStream was introduced in > [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize > reading spill files from disk. > However, investigating flamegraphs of profiles from investigating some > regressed workloads after switch to Spark 2.3, it seems that the hot path of > reading small amounts of data (like readInt) is inefficient - it involves > taking locks, and multiple checks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23366) Improve hot reading path in ReadAheadInputStream
[ https://issues.apache.org/jira/browse/SPARK-23366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23366: --- Assignee: Juliusz Sompolski > Improve hot reading path in ReadAheadInputStream > > > Key: SPARK-23366 > URL: https://issues.apache.org/jira/browse/SPARK-23366 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Fix For: 2.4.0 > > > ReadAheadInputStream was introduced in > [apache/spark#18317|https://github.com/apache/spark/pull/18317] to optimize > reading spill files from disk. > However, investigating flamegraphs of profiles from investigating some > regressed workloads after switch to Spark 2.3, it seems that the hot path of > reading small amounts of data (like readInt) is inefficient - it involves > taking locks, and multiple checks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22817) Use fixed testthat version for SparkR tests in AppVeyor
[ https://issues.apache.org/jira/browse/SPARK-22817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365263#comment-16365263 ] Felix Cheung commented on SPARK-22817: -- I should have caught this - we need to fix the test because it will fail in CRAN - another option is to fix the dependency version in DESCRIPTION file > Use fixed testthat version for SparkR tests in AppVeyor > --- > > Key: SPARK-22817 > URL: https://issues.apache.org/jira/browse/SPARK-22817 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > We happened to access to the internal {{run_tests}} - > https://github.com/r-lib/testthat/blob/v1.0.2/R/test-package.R#L62-L75. > https://github.com/apache/spark/blob/master/R/pkg/tests/run-all.R#L58 > This seems removed out in 2.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23329: Assignee: Apache Spark > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23329: Assignee: (was: Apache Spark) > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365261#comment-16365261 ] Apache Spark commented on SPARK-23329: -- User 'misutoth' has created a pull request for this issue: https://github.com/apache/spark/pull/20618 > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23416) flaky test: org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress test for failOnDataLoss=false
[ https://issues.apache.org/jira/browse/SPARK-23416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23416. - Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 20605 [https://github.com/apache/spark/pull/20605] > flaky test: > org.apache.spark.sql.kafka010.KafkaSourceStressForDontFailOnDataLossSuite.stress > test for failOnDataLoss=false > -- > > Key: SPARK-23416 > URL: https://issues.apache.org/jira/browse/SPARK-23416 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > Fix For: 2.3.1 > > > I suspect this is a race condition latent in the DataSourceV2 write path, or > at least the interaction of that write path with StreamTest. > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87241/testReport/org.apache.spark.sql.kafka010/KafkaSourceStressForDontFailOnDataLossSuite/stress_test_for_failOnDataLoss_false/] > h3. Error Message > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. > h3. Stacktrace > sbt.ForkMain$ForkError: > org.apache.spark.sql.streaming.StreamingQueryException: Query [id = > 16b2a2b1-acdd-44ec-902f-531169193169, runId = > 9567facb-e305-4554-8622-830519002edb] terminated with exception: Writing job > aborted. at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: sbt.ForkMain$ForkError: org.apache.spark.SparkException: Writing > job aborted. at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:108) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:294) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2722) at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$15.apply(MicroBatchExecution.scala:488) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:483) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:482) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at >
[jira] [Resolved] (SPARK-23419) data source v2 write path should re-throw interruption exceptions directly
[ https://issues.apache.org/jira/browse/SPARK-23419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23419. - Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 20605 [https://github.com/apache/spark/pull/20605] > data source v2 write path should re-throw interruption exceptions directly > -- > > Key: SPARK-23419 > URL: https://issues.apache.org/jira/browse/SPARK-23419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.1 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23413) Sorting tasks by Host / Executor ID on the Stage page does not work
[ https://issues.apache.org/jira/browse/SPARK-23413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-23413: --- Priority: Blocker (was: Major) > Sorting tasks by Host / Executor ID on the Stage page does not work > --- > > Key: SPARK-23413 > URL: https://issues.apache.org/jira/browse/SPARK-23413 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0, 2.4.0 >Reporter: Attila Zsolt Piros >Priority: Blocker > > Sorting tasks by Host / Executor ID throws exceptions: > {code} > java.lang.IllegalArgumentException: Invalid sort column: Executor ID at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) > {code} > {code} > java.lang.IllegalArgumentException: Invalid sort column: Host at > org.apache.spark.ui.jobs.ApiHelper$.indexName(StagePage.scala:1017) at > org.apache.spark.ui.jobs.TaskDataSource.sliceData(StagePage.scala:694) at > org.apache.spark.ui.PagedDataSource.pageData(PagedTable.scala:61) at > org.apache.spark.ui.PagedTable$class.table(PagedTable.scala:96) at > org.apache.spark.ui.jobs.TaskPagedTable.table(StagePage.scala:708) at > org.apache.spark.ui.jobs.StagePage.liftedTree1$1(StagePage.scala:293) at > org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:282) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:82) at > org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at > {code} > !image-2018-02-13-16-50-32-600.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org