[jira] [Resolved] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28204. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25003 [https://github.com/apache/spark/pull/25003] > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 3.0.0 > > > SPARK-27534 missed to address my own comments at > https://github.com/WeichenXu123/spark/pull/8 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28204: Assignee: Hyukjin Kwon > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/WeichenXu123/spark/pull/8 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.
[ https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875335#comment-16875335 ] Dongjoon Hyun commented on SPARK-28208: --- As I commented on ORC-525, this is unexpected behavior change to the users at the bug fix release. > Why do we enforce such a behavior change at bug fix release from 1.5.5 to > 1.5.6? > When upgrading to ORC 1.5.6, the reader needs to be closed. > --- > > Key: SPARK-28208 > URL: https://issues.apache.org/jira/browse/SPARK-28208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Owen O'Malley >Priority: Major > > As part of the ORC 1.5.6 release, we optimized the common pattern of: > {code:java} > Reader reader = OrcFile.createReader(...); > RecordReader rows = reader.rows(...);{code} > which used to open one file handle in the Reader and a second one in the > RecordReader. Users were seeing this as a regression when moving from the old > Spark ORC reader via hive to the new native reader, because it opened twice > as many files on the NameNode. > In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in > the Reader until it is either closed or a RecordReader is created from it. > This has cut down the number of file open requests on the NameNode by half in > typical spark applications. (Hive's ORC code avoided this problem by putting > the file footer in to the input splits, but that has other problems.) > To get the new optimization without leaking file handles, Spark needs to be > close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28214) Flaky test: org.apache.spark.streaming.CheckpointSuite.basic rdd checkpoints + dstream graph checkpoint recovery
Marcelo Vanzin created SPARK-28214: -- Summary: Flaky test: org.apache.spark.streaming.CheckpointSuite.basic rdd checkpoints + dstream graph checkpoint recovery Key: SPARK-28214 URL: https://issues.apache.org/jira/browse/SPARK-28214 Project: Spark Issue Type: Bug Components: DStreams, Tests Affects Versions: 3.0.0 Reporter: Marcelo Vanzin This test has failed a few times in some PRs. Example of a failure: {noformat} Error Message org.scalatest.exceptions.TestFailedException: Map() was empty No checkpointed RDDs in state stream before first failure Stacktrace sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: Map() was empty No checkpointed RDDs in state stream before first failure at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) at org.apache.spark.streaming.CheckpointSuite.$anonfun$new$3(CheckpointSuite.scala:266) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at org.apache.spark.streaming.CheckpointSuite.org$scalatest$BeforeAndAfter$$super$runTest(CheckpointSuite.scala:209) {noformat} On top of that, when this failure happens, the test leaves a running {{SparkContext}} behind, which makes every single unit test run after it on that project fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28206: Target Version/s: 3.0.0 > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png > > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28200) Decimal overflow handling in ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-28200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28200: --- Summary: Decimal overflow handling in ExpressionEncoder (was: Overflow handling in `ExpressionEncoder`) > Decimal overflow handling in ExpressionEncoder > -- > > Key: SPARK-28200 > URL: https://issues.apache.org/jira/browse/SPARK-28200 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As pointed out in https://github.com/apache/spark/pull/20350, we are > currently not checking the overflow when serializing a java/scala > `BigDecimal` in `ExpressionEncoder` / `ScalaReflection`. > We should add this check there too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan
[ https://issues.apache.org/jira/browse/SPARK-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28213: Assignee: Apache Spark > Remove duplication between columnar and ColumnarBatchScan > - > > Key: SPARK-28213 > URL: https://issues.apache.org/jira/browse/SPARK-28213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Assignee: Apache Spark >Priority: Major > > There is a lot of duplicate code between Columanr.scala and > ColumanrBatchScan. This should fix that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan
[ https://issues.apache.org/jira/browse/SPARK-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28213: Assignee: (was: Apache Spark) > Remove duplication between columnar and ColumnarBatchScan > - > > Key: SPARK-28213 > URL: https://issues.apache.org/jira/browse/SPARK-28213 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > There is a lot of duplicate code between Columanr.scala and > ColumanrBatchScan. This should fix that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28213) Remove duplication between columnar and ColumnarBatchScan
Robert Joseph Evans created SPARK-28213: --- Summary: Remove duplication between columnar and ColumnarBatchScan Key: SPARK-28213 URL: https://issues.apache.org/jira/browse/SPARK-28213 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Robert Joseph Evans There is a lot of duplicate code between Columanr.scala and ColumanrBatchScan. This should fix that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28192) Data Source - State - Write side
[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875198#comment-16875198 ] Jungtaek Lim commented on SPARK-28192: -- Totally interested! Once I decide to implement this with DSv2, SPARK-23889 is a blocker for this issue, so the sooner the better. If SPARK-23889 is somewhat requiring huge efforts and could become a blocker for Spark 3.0 then I could wait or try to go ahead with DSv1. SPARK-23889 is filed one year ago (with discussion prior to filing issue), so maybe better to take it sooner. > Data Source - State - Write side > > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27945) Make minimal changes to support columnar processing
[ https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27945: - Assignee: Robert Joseph Evans > Make minimal changes to support columnar processing > --- > > Key: SPARK-27945 > URL: https://issues.apache.org/jira/browse/SPARK-27945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Assignee: Robert Joseph Evans >Priority: Major > > As the first step for SPARK-27396 this is to put in the minimum changes > needed to allow a plugin to support columnar processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27945) Make minimal changes to support columnar processing
[ https://issues.apache.org/jira/browse/SPARK-27945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27945. --- Resolution: Fixed Fix Version/s: 3.0.0 > Make minimal changes to support columnar processing > --- > > Key: SPARK-27945 > URL: https://issues.apache.org/jira/browse/SPARK-27945 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Assignee: Robert Joseph Evans >Priority: Major > Fix For: 3.0.0 > > > As the first step for SPARK-27396 this is to put in the minimum changes > needed to allow a plugin to support columnar processing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28209) Shuffle Storage API: Writes
[ https://issues.apache.org/jira/browse/SPARK-28209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28209: Assignee: Apache Spark > Shuffle Storage API: Writes > --- > > Key: SPARK-28209 > URL: https://issues.apache.org/jira/browse/SPARK-28209 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Major > > Adds the write-side API for storing shuffle data in arbitrary storage > systems. Also refactor the existing shuffle write code so that it uses this > API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28209) Shuffle Storage API: Writes
[ https://issues.apache.org/jira/browse/SPARK-28209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28209: Assignee: (was: Apache Spark) > Shuffle Storage API: Writes > --- > > Key: SPARK-28209 > URL: https://issues.apache.org/jira/browse/SPARK-28209 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Matt Cheah >Priority: Major > > Adds the write-side API for storing shuffle data in arbitrary storage > systems. Also refactor the existing shuffle write code so that it uses this > API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28212) Shuffle Storage API: Shuffle Cleanup
Matt Cheah created SPARK-28212: -- Summary: Shuffle Storage API: Shuffle Cleanup Key: SPARK-28212 URL: https://issues.apache.org/jira/browse/SPARK-28212 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.0.0 Reporter: Matt Cheah In the shuffle storage API, there should be a plugin point for removing shuffles that are no longer used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28211) Shuffle Storage API: Driver Lifecycle
Matt Cheah created SPARK-28211: -- Summary: Shuffle Storage API: Driver Lifecycle Key: SPARK-28211 URL: https://issues.apache.org/jira/browse/SPARK-28211 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.0.0 Reporter: Matt Cheah As part of the shuffle storage API, allow users to hook in application-wide startup and shutdown methods. This can do things like create tables in the shuffle storage database, or register / unregister against file servers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28210) Shuffle Storage API: Reads
Matt Cheah created SPARK-28210: -- Summary: Shuffle Storage API: Reads Key: SPARK-28210 URL: https://issues.apache.org/jira/browse/SPARK-28210 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.0.0 Reporter: Matt Cheah As part of the effort to store shuffle data in arbitrary places, this issue tracks implementing an API for reading the shuffle data stored by the write API. Also ensure that the existing shuffle implementation is refactored to use the API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-25299: --- Description: In Spark, the shuffle primitive requires Spark executors to persist data to the local disk of the worker nodes. If executors crash, the external shuffle service can continue to serve the shuffle data that was written beyond the lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the external shuffle service is deployed on every worker node. The shuffle service shares local disk with the executors that run on its node. There are some shortcomings with the way shuffle is fundamentally implemented right now. Particularly: * If any external shuffle service process or node becomes unavailable, all applications that had an executor that ran on that node must recompute the shuffle blocks that were lost. * Similarly to the above, the external shuffle service must be kept running at all times, which may waste resources when no applications are using that shuffle service node. * Mounting local storage can prevent users from taking advantage of desirable isolation benefits from using containerized environments, like Kubernetes. We had an external shuffle service implementation in an early prototype of the Kubernetes backend, but it was rejected due to its strict requirement to be able to mount hostPath volumes or other persistent volume setups. In the following [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] (note: _not_ an SPIP), we brainstorm various high level architectures for improving the external shuffle service in a way that addresses the above problems. The purpose of this umbrella JIRA is to promote additional discussion on how we can approach these problems, both at the architecture level and the implementation level. We anticipate filing sub-issues that break down the tasks that must be completed to achieve this goal. Edit June 28 2019: Our SPIP is here: [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] was: In Spark, the shuffle primitive requires Spark executors to persist data to the local disk of the worker nodes. If executors crash, the external shuffle service can continue to serve the shuffle data that was written beyond the lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the external shuffle service is deployed on every worker node. The shuffle service shares local disk with the executors that run on its node. There are some shortcomings with the way shuffle is fundamentally implemented right now. Particularly: * If any external shuffle service process or node becomes unavailable, all applications that had an executor that ran on that node must recompute the shuffle blocks that were lost. * Similarly to the above, the external shuffle service must be kept running at all times, which may waste resources when no applications are using that shuffle service node. * Mounting local storage can prevent users from taking advantage of desirable isolation benefits from using containerized environments, like Kubernetes. We had an external shuffle service implementation in an early prototype of the Kubernetes backend, but it was rejected due to its strict requirement to be able to mount hostPath volumes or other persistent volume setups. In the following [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] (note: _not_ an SPIP), we brainstorm various high level architectures for improving the external shuffle service in a way that addresses the above problems. The purpose of this umbrella JIRA is to promote additional discussion on how we can approach these problems, both at the architecture level and the implementation level. We anticipate filing sub-issues that break down the tasks that must be completed to achieve this goal. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875106#comment-16875106 ] Matt Cheah commented on SPARK-25299: I also noticed the SPIP document wasn't ever posted on this ticket, so sorry about that! Here's the link for everyone who wasn't following along on the mailing list: [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.
[ https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28208: Assignee: (was: Apache Spark) > When upgrading to ORC 1.5.6, the reader needs to be closed. > --- > > Key: SPARK-28208 > URL: https://issues.apache.org/jira/browse/SPARK-28208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Owen O'Malley >Priority: Major > > As part of the ORC 1.5.6 release, we optimized the common pattern of: > {code:java} > Reader reader = OrcFile.createReader(...); > RecordReader rows = reader.rows(...);{code} > which used to open one file handle in the Reader and a second one in the > RecordReader. Users were seeing this as a regression when moving from the old > Spark ORC reader via hive to the new native reader, because it opened twice > as many files on the NameNode. > In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in > the Reader until it is either closed or a RecordReader is created from it. > This has cut down the number of file open requests on the NameNode by half in > typical spark applications. (Hive's ORC code avoided this problem by putting > the file footer in to the input splits, but that has other problems.) > To get the new optimization without leaking file handles, Spark needs to be > close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.
[ https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28208: Assignee: Apache Spark > When upgrading to ORC 1.5.6, the reader needs to be closed. > --- > > Key: SPARK-28208 > URL: https://issues.apache.org/jira/browse/SPARK-28208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Owen O'Malley >Assignee: Apache Spark >Priority: Major > > As part of the ORC 1.5.6 release, we optimized the common pattern of: > {code:java} > Reader reader = OrcFile.createReader(...); > RecordReader rows = reader.rows(...);{code} > which used to open one file handle in the Reader and a second one in the > RecordReader. Users were seeing this as a regression when moving from the old > Spark ORC reader via hive to the new native reader, because it opened twice > as many files on the NameNode. > In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in > the Reader until it is either closed or a RecordReader is created from it. > This has cut down the number of file open requests on the NameNode by half in > typical spark applications. (Hive's ORC code avoided this problem by putting > the file footer in to the input splits, but that has other problems.) > To get the new optimization without leaking file handles, Spark needs to be > close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28209) Shuffle Storage API: Writes
Matt Cheah created SPARK-28209: -- Summary: Shuffle Storage API: Writes Key: SPARK-28209 URL: https://issues.apache.org/jira/browse/SPARK-28209 Project: Spark Issue Type: Sub-task Components: Shuffle Affects Versions: 3.0.0 Reporter: Matt Cheah Adds the write-side API for storing shuffle data in arbitrary storage systems. Also refactor the existing shuffle write code so that it uses this API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875101#comment-16875101 ] Matt Cheah commented on SPARK-25299: Let's start by making sub-issues. I have a patch staged for master I intend to post by end of day. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.
Owen O'Malley created SPARK-28208: - Summary: When upgrading to ORC 1.5.6, the reader needs to be closed. Key: SPARK-28208 URL: https://issues.apache.org/jira/browse/SPARK-28208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Owen O'Malley As part of the ORC 1.5.6 release, we optimized the common pattern of: {code:java} Reader reader = OrcFile.createReader(...); RecordReader rows = reader.rows(...);{code} which used to open one file handle in the Reader and a second one in the RecordReader. Users were seeing this as a regression when moving from the old Spark ORC reader via hive to the new native reader, because it opened twice as many files on the NameNode. In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in the Reader until it is either closed or a RecordReader is created from it. This has cut down the number of file open requests on the NameNode by half in typical spark applications. (Hive's ORC code avoided this problem by putting the file footer in to the input splits, but that has other problems.) To get the new optimization without leaking file handles, Spark needs to be close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-28114) Add Jenkins job for `Hadoop-3.2` profile
[ https://issues.apache.org/jira/browse/SPARK-28114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shane knapp closed SPARK-28114. --- > Add Jenkins job for `Hadoop-3.2` profile > > > Key: SPARK-28114 > URL: https://issues.apache.org/jira/browse/SPARK-28114 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: shane knapp >Priority: Major > > Spark 3.0 is a major version change. We want to have the following new Jobs. > 1. SBT with hadoop-3.2 > 2. Maven with hadoop-3.2 (on JDK8 and JDK11) > Also, shall we have a limit for the concurrent run for the following existing > job? Currently, it invokes multiple jobs concurrently. We can save the > resource by limiting to 1 like the other jobs. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-jdk-11-ubuntu-testing > We will drop four `branch-2.3` jobs at the end of August, 2019. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22207) High memory usage when converting relational data to Hierarchical data
[ https://issues.apache.org/jira/browse/SPARK-22207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kanika dhuria reopened SPARK-22207: --- Same issue is seen in spark 2.4 > High memory usage when converting relational data to Hierarchical data > -- > > Key: SPARK-22207 > URL: https://issues.apache.org/jira/browse/SPARK-22207 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: kanika dhuria >Priority: Major > Labels: bulk-closed > > Have 4 tables > lineitems ~1.4Gb, > orders ~ 330MB > customer ~47MB > nations ~ 2.2K > These tables are related as follows > There are multiple lineitems per order (pk, fk:orderkey) > There are multiple orders per customer(pk,fk: cust_key) > There are multiple customers per nation(pk, fk:nation key) > Data is almost evenly distributed. > Building hierarchy till 3 levels i.e joining lineitems, orders, customers > works good with executor memory 4Gb/2cores > Adding nations require 8GB/2 cores or 4GB/1 core memory. > == > {noformat} > val sqlContext = SparkSession.builder() .enableHiveSupport() > .config("spark.sql.retainGroupColumns", false) > .config("spark.sql.crossJoin.enabled", true) .getOrCreate() > > val orders = sqlContext.sql("select * from orders") > val lineItem = sqlContext.sql("select * from lineitems") > > val customer = sqlContext.sql("select * from customers") > > val nation = sqlContext.sql("select * from nations") > > val lineitemOrders = > lineItem.groupBy(col("l_orderkey")).agg(col("l_orderkey"), > collect_list(struct(col("l_partkey"), > col("l_suppkey"),col("l_linenumber"),col("l_quantity"),col("l_extendedprice"),col("l_discount"),col("l_tax"),col("l_returnflag"),col("l_linestatus"),col("l_shipdate"),col("l_commitdate"),col("l_receiptdate"),col("l_shipinstruct"),col("l_shipmode"))).as("lineitem")).join(orders, > orders("O_ORDERKEY")=== lineItem("l_orderkey")).select(col("O_ORDERKEY"), > col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_TOTALPRICE"), > col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), > col("O_SHIPPRIORITY"), col("O_COMMENT"), col("lineitem")) > > val customerList = > lineitemOrders.groupBy(col("o_custkey")).agg(col("o_custkey"),collect_list(struct(col("O_ORDERKEY"), > col("O_CUSTKEY"), col("O_ORDERSTATUS"), col("O_TOTALPRICE"), > col("O_ORDERDATE"), col("O_ORDERPRIORITY"), col("O_CLERK"), > col("O_SHIPPRIORITY"), > col("O_COMMENT"),col("lineitem"))).as("items")).join(customer,customer("c_custkey")=== > > lineitemOrders("o_custkey")).select(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items")) > val nationList = > customerList.groupBy(col("c_nationkey")).agg(col("c_nationkey"),collect_list(struct(col("c_custkey"),col("c_name"),col("c_nationkey"),col("items"))).as("custList")).join(nation,nation("n_nationkey")===customerList("c_nationkey")).select(col("n_nationkey"),col("n_name"),col("custList")) > > nationList.write.mode("overwrite").json("filePath") > {noformat} > > If the customeList is saved in a file and then the last agg/join is run > separately, it does run fine in 4GB/2 core . > I can provide the data if needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-28207) https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit
[ https://issues.apache.org/jira/browse/SPARK-28207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin deleted SPARK-28207: --- > https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit > -- > > Key: SPARK-28207 > URL: https://issues.apache.org/jira/browse/SPARK-28207 > Project: Spark > Issue Type: Bug >Reporter: Roufique Hossain >Priority: Minor > Labels: http://schemas.xmlsoap.org/ws/2004/09/policy > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28207) https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit
Roufique Hossain created SPARK-28207: Summary: https://rtatdotblog.wordpress.com/2019/05/30/rohit-travels-tours-rohit Key: SPARK-28207 URL: https://issues.apache.org/jira/browse/SPARK-28207 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 2.4.3 Reporter: Roufique Hossain -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Attachment: Screen Shot 2019-06-28 at 9.55.13 AM.png > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > Attachments: Screen Shot 2019-06-28 at 9.55.13 AM.png > > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Issue Type: Bug (was: Documentation) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc (was: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html API doc > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28206) "@pandas_udf" in doctest is rendered as ":pandas_udf" in html
[ https://issues.apache.org/jira/browse/SPARK-28206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-28206: -- Summary: "@pandas_udf" in doctest is rendered as ":pandas_udf" in html (was: "@" is rendered as ":" in doctest) > "@pandas_udf" in doctest is rendered as ":pandas_udf" in html > - > > Key: SPARK-28206 > URL: https://issues.apache.org/jira/browse/SPARK-28206 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 2.4.1 >Reporter: Xiangrui Meng >Priority: Major > > Just noticed that in [pandas_udf API doc > |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], > "@pandas_udf" is render as ":pandas_udf". > cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28206) "@" is rendered as ":" in doctest
Xiangrui Meng created SPARK-28206: - Summary: "@" is rendered as ":" in doctest Key: SPARK-28206 URL: https://issues.apache.org/jira/browse/SPARK-28206 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 2.4.1 Reporter: Xiangrui Meng Just noticed that in [pandas_udf API doc |https://spark.apache.org/docs/2.4.1/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf], "@pandas_udf" is render as ":pandas_udf". cc: [~hyukjin.kwon] [~smilegator] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28192) Data Source - State - Write side
[ https://issues.apache.org/jira/browse/SPARK-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875046#comment-16875046 ] Ryan Blue commented on SPARK-28192: --- It sounds like what you want is for a source to be able to communicate the required clustering and sort order for a write, is that correct? I opened an issue for this a while ago, but it probably won't be on the roadmap for Spark 3.0: SPARK-23889. We can do that sooner if you're interested in it! > Data Source - State - Write side > > > Key: SPARK-28192 > URL: https://issues.apache.org/jira/browse/SPARK-28192 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > This issue tracks the efforts on addressing batch write on state data source. > It could include "state repartition" if it doesn't require huge effort for > new DSv2, but it can be also move out to separate issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28145) Executor pods polling source can fail to replace dead executors
[ https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-28145. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24952 [https://github.com/apache/spark/pull/24952] > Executor pods polling source can fail to replace dead executors > --- > > Key: SPARK-28145 > URL: https://issues.apache.org/jira/browse/SPARK-28145 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Onur Satici >Assignee: Onur Satici >Priority: Minor > Fix For: 3.0.0 > > > Scheduled task responsible for reporting executor snapshots to the executor > allocator in kubernetes will die on any error, killing subsequent runs of the > same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28145) Executor pods polling source can fail to replace dead executors
[ https://issues.apache.org/jira/browse/SPARK-28145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-28145: - Assignee: Onur Satici > Executor pods polling source can fail to replace dead executors > --- > > Key: SPARK-28145 > URL: https://issues.apache.org/jira/browse/SPARK-28145 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 2.4.3 >Reporter: Onur Satici >Assignee: Onur Satici >Priority: Minor > > Scheduled task responsible for reporting executor snapshots to the executor > allocator in kubernetes will die on any error, killing subsequent runs of the > same task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28205) useV1SourceList configuration should be for all data sources
[ https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28205: Assignee: Apache Spark > useV1SourceList configuration should be for all data sources > > > Key: SPARK-28205 > URL: https://issues.apache.org/jira/browse/SPARK-28205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > In the migration PR of Kafka V2: > https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645 > We find that the useV1SourceList > configuration(spark.sql.sources.read.useV1SourceList and > spark.sql.sources.write.useV1SourceList) should be for all data sources, > instead of file source V2 only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28205) useV1SourceList configuration should be for all data sources
[ https://issues.apache.org/jira/browse/SPARK-28205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28205: Assignee: (was: Apache Spark) > useV1SourceList configuration should be for all data sources > > > Key: SPARK-28205 > URL: https://issues.apache.org/jira/browse/SPARK-28205 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > In the migration PR of Kafka V2: > https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645 > We find that the useV1SourceList > configuration(spark.sql.sources.read.useV1SourceList and > spark.sql.sources.write.useV1SourceList) should be for all data sources, > instead of file source V2 only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28205) useV1SourceList configuration should be for all data sources
Gengliang Wang created SPARK-28205: -- Summary: useV1SourceList configuration should be for all data sources Key: SPARK-28205 URL: https://issues.apache.org/jira/browse/SPARK-28205 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang In the migration PR of Kafka V2: https://github.com/apache/spark/pull/24738/files/ac16c9a9ef1c68db5aeda6c7001ae9abe96a358a#r298470645 We find that the useV1SourceList configuration(spark.sql.sources.read.useV1SourceList and spark.sql.sources.write.useV1SourceList) should be for all data sources, instead of file source V2 only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28204: Assignee: (was: Apache Spark) > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/WeichenXu123/spark/pull/8 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28204: Assignee: Apache Spark > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/WeichenXu123/spark/pull/8 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28204: - Description: SPARK-27534 missed to address my own comments at https://github.com/WeichenXu123/spark/pull/8 It's better to push this in since the codes are already cleaned up. was: SPARK-27534 missed to address my own comments at https://github.com/HyukjinKwon?tab=overview&from=2019-04-01&to=2019-04-30 It's better to push this in since the codes are already cleaned up. > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/WeichenXu123/spark/pull/8 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28204: - Issue Type: Test (was: New Feature) > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/HyukjinKwon?tab=overview&from=2019-04-01&to=2019-04-30 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28204) Make separate two test cases for column pruning in binary files
[ https://issues.apache.org/jira/browse/SPARK-28204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28204: - Component/s: Tests > Make separate two test cases for column pruning in binary files > --- > > Key: SPARK-28204 > URL: https://issues.apache.org/jira/browse/SPARK-28204 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > SPARK-27534 missed to address my own comments at > https://github.com/HyukjinKwon?tab=overview&from=2019-04-01&to=2019-04-30 > It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28204) Make separate two test cases for column pruning in binary files
Hyukjin Kwon created SPARK-28204: Summary: Make separate two test cases for column pruning in binary files Key: SPARK-28204 URL: https://issues.apache.org/jira/browse/SPARK-28204 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon SPARK-27534 missed to address my own comments at https://github.com/HyukjinKwon?tab=overview&from=2019-04-01&to=2019-04-30 It's better to push this in since the codes are already cleaned up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap
[ https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28203: Assignee: (was: Apache Spark) > PythonRDD should respect SparkContext's conf when passing user confMap > -- > > Key: SPARK-28203 > URL: https://issues.apache.org/jira/browse/SPARK-28203 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.3 >Reporter: Xianjin YE >Priority: Minor > > PythonRDD have several API which accepts user configs from python side. The > parameter is called confAsMap and it's intended to merge with RDD's hadoop > configuration. > However, the confAsMap is first mapped to Configuration then merged into > SparkContext's hadoop configuration. The mapped Configuration will load > default key values in core-default.xml etc., which may be updated in > SparkContext's hadoop configuration. The default value will override updated > value in the merge process. > I will submit a pr to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap
[ https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28203: Assignee: Apache Spark > PythonRDD should respect SparkContext's conf when passing user confMap > -- > > Key: SPARK-28203 > URL: https://issues.apache.org/jira/browse/SPARK-28203 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.3 >Reporter: Xianjin YE >Assignee: Apache Spark >Priority: Minor > > PythonRDD have several API which accepts user configs from python side. The > parameter is called confAsMap and it's intended to merge with RDD's hadoop > configuration. > However, the confAsMap is first mapped to Configuration then merged into > SparkContext's hadoop configuration. The mapped Configuration will load > default key values in core-default.xml etc., which may be updated in > SparkContext's hadoop configuration. The default value will override updated > value in the merge process. > I will submit a pr to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28107) Interval type conversion syntax support
[ https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28107: Assignee: (was: Apache Spark) > Interval type conversion syntax support > --- > > Key: SPARK-28107 > URL: https://issues.apache.org/jira/browse/SPARK-28107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > According to the 03 ANSI SQL standard, for the interval type conversion. > SparkSQL now can only support > * Interval year to month > * Interval day to second > * Interval hour to second > There are some other syntax which are both supported in PostgreSQL and 03 > ANSI SQL. > * Interval day to hour > * Interval day to minute > * Interval hour to minute > * Interval minute to second -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28107) Interval type conversion syntax support
[ https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28107: Assignee: Apache Spark > Interval type conversion syntax support > --- > > Key: SPARK-28107 > URL: https://issues.apache.org/jira/browse/SPARK-28107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Assignee: Apache Spark >Priority: Major > > According to the 03 ANSI SQL standard, for the interval type conversion. > SparkSQL now can only support > * Interval year to month > * Interval day to second > * Interval hour to second > There are some other syntax which are both supported in PostgreSQL and 03 > ANSI SQL. > * Interval day to hour > * Interval day to minute > * Interval hour to minute > * Interval minute to second -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28107) Interval type conversion syntax support
[ https://issues.apache.org/jira/browse/SPARK-28107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874842#comment-16874842 ] Apache Spark commented on SPARK-28107: -- User 'lipzhu' has created a pull request for this issue: https://github.com/apache/spark/pull/25000 > Interval type conversion syntax support > --- > > Key: SPARK-28107 > URL: https://issues.apache.org/jira/browse/SPARK-28107 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhu, Lipeng >Priority: Major > > According to the 03 ANSI SQL standard, for the interval type conversion. > SparkSQL now can only support > * Interval year to month > * Interval day to second > * Interval hour to second > There are some other syntax which are both supported in PostgreSQL and 03 > ANSI SQL. > * Interval day to hour > * Interval day to minute > * Interval hour to minute > * Interval minute to second -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap
Xianjin YE created SPARK-28203: -- Summary: PythonRDD should respect SparkContext's conf when passing user confMap Key: SPARK-28203 URL: https://issues.apache.org/jira/browse/SPARK-28203 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 2.4.3 Reporter: Xianjin YE PythonRDD have several API which accepts user configs from python side. The parameter is called confAsMap and it's intended to merge with RDD's hadoop configuration. However, the confAsMap is first mapped to Configuration then merged into SparkContext's hadoop configuration. The mapped Configuration will load default key values in core-default.xml etc., which may be updated in SparkContext's hadoop configuration. The default value will override updated value in the merge process. I will submit a pr to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28077) ANSI SQL: OVERLAY function(T312)
[ https://issues.apache.org/jira/browse/SPARK-28077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-28077. --- Resolution: Fixed Assignee: jiaan.geng Fix Version/s: 3.0.0 Issue resolved by pull request 24918 https://github.com/apache/spark/pull/24918 > ANSI SQL: OVERLAY function(T312) > > > Key: SPARK-28077 > URL: https://issues.apache.org/jira/browse/SPARK-28077 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{overlay(_string_ placing _string_ from }}{{int}}{{[for > {{int}}])}}|{{text}}|Replace substring|{{overlay('Tas' placing 'hom' from > 2 for 4)}}|{{Thomas}}| > For example: > {code:sql} > SELECT OVERLAY('abcdef' PLACING '45' FROM 4) AS "abc45f"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5) AS "yabadaba"; > SELECT OVERLAY('yabadoo' PLACING 'daba' FROM 5 FOR 0) AS "yabadabadoo"; > SELECT OVERLAY('babosa' PLACING 'ubb' FROM 2 FOR 4) AS "bubba"; > {code} > https://www.postgresql.org/docs/11/functions-string.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause
[ https://issues.apache.org/jira/browse/SPARK-28083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28083: Assignee: Apache Spark > ANSI SQL: LIKE predicate: ESCAPE clause > --- > > Key: SPARK-28083 > URL: https://issues.apache.org/jira/browse/SPARK-28083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Format: > {noformat} > ::= > > | > ::= > > ::= > [ NOT ] LIKE [ ESCAPE ] > ::= > > ::= > > ::= > > ::= > [ NOT ] LIKE [ ESCAPE ] > ::= > > ::= > > {noformat} > > [https://www.postgresql.org/docs/11/functions-matching.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28083) ANSI SQL: LIKE predicate: ESCAPE clause
[ https://issues.apache.org/jira/browse/SPARK-28083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28083: Assignee: (was: Apache Spark) > ANSI SQL: LIKE predicate: ESCAPE clause > --- > > Key: SPARK-28083 > URL: https://issues.apache.org/jira/browse/SPARK-28083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Format: > {noformat} > ::= > > | > ::= > > ::= > [ NOT ] LIKE [ ESCAPE ] > ::= > > ::= > > ::= > > ::= > [ NOT ] LIKE [ ESCAPE ] > ::= > > ::= > > {noformat} > > [https://www.postgresql.org/docs/11/functions-matching.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite
[ https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28202: Assignee: Apache Spark > [Core] [Test] Avoid noises of system props in SparkConfSuite > > > Key: SPARK-28202 > URL: https://issues.apache.org/jira/browse/SPARK-28202 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: ShuMing Li >Assignee: Apache Spark >Priority: Trivial > > When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, > `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system > props`. So when runs `core/test` module, it is possible to fail to run > `SparkConfSuite` . > > It's easy to repair by setting `loadDefaults` in `SparkConf` to be false. > ``` > [info] - accumulators (5 seconds, 565 milliseconds) > [info] - deprecated configs *** FAILED *** (79 milliseconds) > [info] 7 did not equal 4 (SparkConfSuite.scala:266) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite
[ https://issues.apache.org/jira/browse/SPARK-28202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28202: Assignee: (was: Apache Spark) > [Core] [Test] Avoid noises of system props in SparkConfSuite > > > Key: SPARK-28202 > URL: https://issues.apache.org/jira/browse/SPARK-28202 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: ShuMing Li >Priority: Trivial > > When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, > `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system > props`. So when runs `core/test` module, it is possible to fail to run > `SparkConfSuite` . > > It's easy to repair by setting `loadDefaults` in `SparkConf` to be false. > ``` > [info] - accumulators (5 seconds, 565 milliseconds) > [info] - deprecated configs *** FAILED *** (79 milliseconds) > [info] 7 did not equal 4 (SparkConfSuite.scala:266) > [info] org.scalatest.exceptions.TestFailedException: > [info] at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) > [info] at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) > [info] at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > [info] at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > [info] at > org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) > [info] at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > [info] at > org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28202) [Core] [Test] Avoid noises of system props in SparkConfSuite
ShuMing Li created SPARK-28202: -- Summary: [Core] [Test] Avoid noises of system props in SparkConfSuite Key: SPARK-28202 URL: https://issues.apache.org/jira/browse/SPARK-28202 Project: Spark Issue Type: Test Components: Spark Core Affects Versions: 3.0.0 Reporter: ShuMing Li When SPARK_HOME of env is set and contains a specific `spark-defaults,conf`, `org.apache.spark.util.loadDefaultSparkProperties` method may noise `system props`. So when runs `core/test` module, it is possible to fail to run `SparkConfSuite` . It's easy to repair by setting `loadDefaults` in `SparkConf` to be false. ``` [info] - accumulators (5 seconds, 565 milliseconds) [info] - deprecated configs *** FAILED *** (79 milliseconds) [info] 7 did not equal 4 (SparkConfSuite.scala:266) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) [info] at org.apache.spark.SparkConfSuite.$anonfun$new$26(SparkConfSuite.scala:266) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) [info] at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) [info] at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28201) Revisit MakeDecimal behavior on overflow
[ https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874765#comment-16874765 ] Marco Gaido commented on SPARK-28201: - I'll create a PR for this ASAP. > Revisit MakeDecimal behavior on overflow > > > Key: SPARK-28201 > URL: https://issues.apache.org/jira/browse/SPARK-28201 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > As pointed out in > https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special > cases of decimal aggregation we are using the `MakeDecimal` operator. > This operator has a not well defined behavior in case of overflow, namely > what it does currently is: > - if codegen is enabled it returns null; > - in interpreted mode it throws an `IllegalArgumentException`. > So we should make his behavior uniform with other similar cases and in > particular we should honor the value of the conf introduced in SPARK-23179 > and behave accordingly, ie.: > - returning null if the flag is true; > - throw an `ArithmeticException` if the flag is false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28201) Revisit MakeDecimal behavior on overflow
Marco Gaido created SPARK-28201: --- Summary: Revisit MakeDecimal behavior on overflow Key: SPARK-28201 URL: https://issues.apache.org/jira/browse/SPARK-28201 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Marco Gaido As pointed out in https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special cases of decimal aggregation we are using the `MakeDecimal` operator. This operator has a not well defined behavior in case of overflow, namely what it does currently is: - if codegen is enabled it returns null; - in interpreted mode it throws an `IllegalArgumentException`. So we should make his behavior uniform with other similar cases and in particular we should honor the value of the conf introduced in SPARK-23179 and behave accordingly, ie.: - returning null if the flag is true; - throw an `ArithmeticException` if the flag is false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames
[ https://issues.apache.org/jira/browse/SPARK-28198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28198: Assignee: (was: Apache Spark) > Add mapPartitionsInPandas to allow an iterator of DataFrames > > > Key: SPARK-28198 > URL: https://issues.apache.org/jira/browse/SPARK-28198 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be > good to use this whtout the limitation of length. > This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas > UDF and Arrow / Pandas integration in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames
[ https://issues.apache.org/jira/browse/SPARK-28198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28198: Assignee: Apache Spark > Add mapPartitionsInPandas to allow an iterator of DataFrames > > > Key: SPARK-28198 > URL: https://issues.apache.org/jira/browse/SPARK-28198 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be > good to use this whtout the limitation of length. > This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas > UDF and Arrow / Pandas integration in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early
[ https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28185. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24986 [https://github.com/apache/spark/pull/24986] > Trigger pandas iterator UDF closing stuff when iterator stop early > -- > > Key: SPARK-28185 > URL: https://issues.apache.org/jira/browse/SPARK-28185 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop > early. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28185) Trigger pandas iterator UDF closing stuff when iterator stop early
[ https://issues.apache.org/jira/browse/SPARK-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28185: Assignee: Weichen Xu > Trigger pandas iterator UDF closing stuff when iterator stop early > -- > > Key: SPARK-28185 > URL: https://issues.apache.org/jira/browse/SPARK-28185 > Project: Spark > Issue Type: Bug > Components: ML, SQL >Affects Versions: 2.4.3 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Fix the issue Pandas UDF closing stuff won't be triggered when iterator stop > early. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28200) Overflow handling in `ExpressionEncoder`
Marco Gaido created SPARK-28200: --- Summary: Overflow handling in `ExpressionEncoder` Key: SPARK-28200 URL: https://issues.apache.org/jira/browse/SPARK-28200 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Marco Gaido As pointed out in https://github.com/apache/spark/pull/20350, we are currently not checking the overflow when serializing a java/scala `BigDecimal` in `ExpressionEncoder` / `ScalaReflection`. We should add this check there too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28199: Assignee: Apache Spark > Remove usage of ProcessingTime in Spark codebase > > > Key: SPARK-28199 > URL: https://issues.apache.org/jira/browse/SPARK-28199 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Minor > > Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark > codebase, and actually the alternative Spark proposes use deprecated methods > which feels like circular - never be able to remove usage. > This issue targets to deal with removing usage of ProcessingTime in Spark > codebase, via adding new class to replace ProcessingTime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-28199: Assignee: (was: Apache Spark) > Remove usage of ProcessingTime in Spark codebase > > > Key: SPARK-28199 > URL: https://issues.apache.org/jira/browse/SPARK-28199 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark > codebase, and actually the alternative Spark proposes use deprecated methods > which feels like circular - never be able to remove usage. > This issue targets to deal with removing usage of ProcessingTime in Spark > codebase, via adding new class to replace ProcessingTime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase
[ https://issues.apache.org/jira/browse/SPARK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16874741#comment-16874741 ] Jungtaek Lim commented on SPARK-28199: -- Working on this. Actually I worked this as minor one but realized I had to introduce a new class, so filed an issue. > Remove usage of ProcessingTime in Spark codebase > > > Key: SPARK-28199 > URL: https://issues.apache.org/jira/browse/SPARK-28199 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Minor > > Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark > codebase, and actually the alternative Spark proposes use deprecated methods > which feels like circular - never be able to remove usage. > This issue targets to deal with removing usage of ProcessingTime in Spark > codebase, via adding new class to replace ProcessingTime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28199) Remove usage of ProcessingTime in Spark codebase
Jungtaek Lim created SPARK-28199: Summary: Remove usage of ProcessingTime in Spark codebase Key: SPARK-28199 URL: https://issues.apache.org/jira/browse/SPARK-28199 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Jungtaek Lim Even ProcessingTime is deprecated in 2.2.0, it's being used in Spark codebase, and actually the alternative Spark proposes use deprecated methods which feels like circular - never be able to remove usage. This issue targets to deal with removing usage of ProcessingTime in Spark codebase, via adding new class to replace ProcessingTime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28198) Add mapPartitionsInPandas to allow an iterator of DataFrames
Hyukjin Kwon created SPARK-28198: Summary: Add mapPartitionsInPandas to allow an iterator of DataFrames Key: SPARK-28198 URL: https://issues.apache.org/jira/browse/SPARK-28198 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon SPARK-26412 added a new type of Pandas UDF called Scalar Iter. It should be good to use this whtout the limitation of length. This JIRA targets to add {{mapPartitionsInPandas}} that leverages this Pandas UDF and Arrow / Pandas integration in Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org