[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729357#comment-16729357 ] Saisai Shao commented on SPARK-25299: - [~jealous] Can we have a doc about this proposed solution for us to review? > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26449: - Summary: Missing Dataframe.transform API in Python API (was: Dataframe.transform) > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space
[ https://issues.apache.org/jira/browse/SPARK-26452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26452. -- Resolution: Invalid Please don't just copy and paste the error message. Include information about expected input, output, reproducible codes if possible and your analysis. > Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: > Java heap space > - > > Key: SPARK-26452 > URL: https://issues.apache.org/jira/browse/SPARK-26452 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: tommy duan >Priority: Major > > [Stage 1852:===>(896 + 3) / > 900] [Stage 1852:===>(897 + > 3) / 900] [Stage > 1852:===>(899 + 1) / 900] > [Stage 1853:> (0 + 0) / 900]18/12/27 06:03:45 WARN util.Utils: Suppressing > exception in finally: Java heap space > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at > net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) > at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) > at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) > at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) > at > org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: > Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at >
[jira] [Commented] (SPARK-26449) Dataframe.transform
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729301#comment-16729301 ] Hyukjin Kwon commented on SPARK-26449: -- Seems like Scala side has it but Python doesn't. Can you open a PR with a regression test? > Dataframe.transform > --- > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26449: - Issue Type: Improvement (was: New Feature) > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26449: - Labels: (was: patch) > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26449: - Component/s: PySpark > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26438) Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - is this by design?
[ https://issues.apache.org/jira/browse/SPARK-26438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729300#comment-16729300 ] Hyukjin Kwon commented on SPARK-26438: -- Let's ask a question to Spark mailing list before filing an issue. You could have a better answer than this. > Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - > is this by design? > > > Key: SPARK-26438 > URL: https://issues.apache.org/jira/browse/SPARK-26438 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Shay Elbaz >Priority: Trivial > > When broadcasting too large DataFrame, the driver does not fail immediately > when the broadcast thread throws OutOfMemoryError. Instead it waits for > `spark.sql.broadcastTimeout` to meet. Is that by design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26438) Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - is this by design?
[ https://issues.apache.org/jira/browse/SPARK-26438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26438. -- Resolution: Invalid > Driver waits to spark.sql.broadcastTimeout before throwing OutOfMemoryError - > is this by design? > > > Key: SPARK-26438 > URL: https://issues.apache.org/jira/browse/SPARK-26438 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Shay Elbaz >Priority: Trivial > > When broadcasting too large DataFrame, the driver does not fail immediately > when the broadcast thread throws OutOfMemoryError. Instead it waits for > `spark.sql.broadcastTimeout` to meet. Is that by design or a bug? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26268) Decouple shuffle data from Spark deployment
[ https://issues.apache.org/jira/browse/SPARK-26268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729297#comment-16729297 ] Peiyu Zhuang commented on SPARK-26268: -- Check [SPARK-25299|https://issues.apache.org/jira/browse/SPARK-25299], we are trying to implement a shuffle manager with storage plugin that could support different kinds of external/local storage. The work will be open-source soon. > Decouple shuffle data from Spark deployment > --- > > Key: SPARK-26268 > URL: https://issues.apache.org/jira/browse/SPARK-26268 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Ben Sidhom >Priority: Major > > Right now the batch scheduler assumes that shuffle data is tied to executors. > As a result, when an executor is lost, any map tasks that ran on that > executor are rescheduled unless the "external" shuffle service is being used. > Note that this service is only external in the sense that it does not live > within executors themselves; its implementation cannot be swapped out and it > is assumed to speak the BlockManager language. > The following changes would facilitate external shuffle (see SPARK-25299 for > motivation): > * Do not rerun map tasks on lost executors when shuffle data is stored > externally. For example, this could be determined by a property or by an > additional method that all ShuffleManagers implement. > * Do not assume that shuffle data is stored in the standard BlockManager > format or that a BlockManager is or must be available to ShuffleManagers. > Note that only the first change is actually required to realize the benefits > of remote shuffle implementations as a phony (or null) BlockManager can be > used by shuffle implementations. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729292#comment-16729292 ] Peiyu Zhuang edited comment on SPARK-25299 at 12/27/18 3:31 AM: We are currently working on a solution that is similar to option 3 mentioned in this [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]. The idea is to refactor the current shuffle manager and extract a common storage interface. User could supply different storage implementations for shuffle data and spill data. We have got some preliminary test result. Since shuffle manager is critical to Spark, we want to make sure it functions just as the original shuffle manager. And it will be open-source in the near future. was (Author: jealous): We are currently working on a solution that is similar to option 3 mentioned in this [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]. The idea is to refactor the current shuffle manager and extract a common storage interface. User could supply different storage implementations for shuffle data and spill data. We have got some preliminary test result. Since shuffle manager is critical to Spark, we want to make sure it functions just as the original shuffle manager. And it will be open-sourced in the near future. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729294#comment-16729294 ] Carson Wang commented on SPARK-25299: - I am on a vacation and will be back on January 2, 2019. Please expect delayed response. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729292#comment-16729292 ] Peiyu Zhuang commented on SPARK-25299: -- We are currently working on a solution that is similar to option 3 mentioned in this [architecture discussion document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]. The idea is to refactor the current shuffle manager and extract a common storage interface. User could supply different storage implementations for shuffle data and spill data. We have got some preliminary test result. Since shuffle manager is critical to Spark, we want to make sure it functions just as the original shuffle manager. And it will be open-sourced in the near future. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26424) Use java.time API in timestamp/date functions
[ https://issues.apache.org/jira/browse/SPARK-26424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26424. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23358 [https://github.com/apache/spark/pull/23358] > Use java.time API in timestamp/date functions > -- > > Key: SPARK-26424 > URL: https://issues.apache.org/jira/browse/SPARK-26424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, date/time expressions like DateFormatClass, UnixTimestamp, > FromUnixTime use SimpleDateFormat to parse/format dates and timestamps. > SimpleDateFormat cannot parse timestamp with microsecond precision. The > ticket aims to switch the expression on TimestampFormatter which is able to > parse timestamp with microsecond precision. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26424) Use java.time API in timestamp/date functions
[ https://issues.apache.org/jira/browse/SPARK-26424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26424: --- Assignee: Maxim Gekk > Use java.time API in timestamp/date functions > -- > > Key: SPARK-26424 > URL: https://issues.apache.org/jira/browse/SPARK-26424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, date/time expressions like DateFormatClass, UnixTimestamp, > FromUnixTime use SimpleDateFormat to parse/format dates and timestamps. > SimpleDateFormat cannot parse timestamp with microsecond precision. The > ticket aims to switch the expression on TimestampFormatter which is able to > parse timestamp with microsecond precision. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26452) Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space
tommy duan created SPARK-26452: -- Summary: Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space Key: SPARK-26452 URL: https://issues.apache.org/jira/browse/SPARK-26452 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.2.0 Reporter: tommy duan [Stage 1852:===>(896 + 3) / 900] [Stage 1852:===>(897 + 3) / 900] [Stage 1852:===>(899 + 1) / 900] [Stage 1853:> (0 + 0) / 900]18/12/27 06:03:45 WARN util.Utils: Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) at org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488) at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:776) at org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:775) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:775) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1259) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1711) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) at org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) at net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) at net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) at
[jira] [Updated] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet
[ https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-26439: - Description: Currently, Barrier TaskSet has a hard requirement that tasks can only be launched in a single resourceOffers round with enough slots(or sufficient resources), but can not be guaranteed even if with enough slots due to task locality delay scheduling. So, it is very likely that Barrier TaskSet gets a chunk of sufficient resources after all the trouble, but let it go easily just beacuae one of pending tasks can not be scheduled. Futhermore, it causes severe resource competition between TaskSets and jobs and introduce unclear semantic for DynamicAllocation. This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier TaskSet, which allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, and launch tasks at the same time once it accumulate the sufficient resource. In this way, we relax the requirement of resources for the Barrier TaskSet. To avoid the deadlock which may be introuduced by serveral Barrier TaskSets holding the reserved WorkerOffers for a long time, we'll ask Barrier TaskSets to force releasing part of reserved WorkerOffers on demand. So, it is highly possible that each Barrier TaskSet would be launched in the end. To integrate with DynamicAllocation The possible effective way I can imagine is that adding new event, e.g. ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy executor with running tasks or idle executor without running tasks. Thus, ExecutionAllocationManager would not let the executor go if it reminds of there're some reserved resource on that executor. was: Currently, Barrier TaskSet has a hard requirement that tasks can only be launched in a single resourceOffers round with enough slots(or sufficient resources), but can not be guaranteed even if with enough slots due to task locality delay scheduling. So, it is very likely that Barrier TaskSet gets a chunk of sufficient resources after all the trouble, but let it go easily just beacuae one of pending tasks can not be scheduled. Futhermore, it causes severe resource competition between TaskSets and jobs and introduce unclear semantic for DynamicAllocation. This JIRA trys to introduce WorkOffer reservation mechanism for Barrier TaskSet, which allows Barrier TaskSet to reserve WorkOffer in each resourceOffers round, and launch tasks at the same time once it accumulate the sufficient resource. In this way, we relax the requirement of resources for the Barrier TaskSet. To avoid the deadlock which may be introuduced by serveral Barrier TaskSets holding the reserved WorkOffer for a long time, we'll ask Barrier TaskSets to force releasing part of reserved WorkOffers on demand. So, it is highly possible that each Barrier TaskSet would be launched in the end. To integrate with DynamicAllocation The possible effective way I can imagine is that adding new event, e.g. ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy executor with running tasks or idle executor without running tasks. Thus, ExecutionAllocationManager would not let the executor go if it reminds of there're some reserved resource on that executor. > Introduce WorkerOffer reservation mechanism for Barrier TaskSet > --- > > Key: SPARK-26439 > URL: https://issues.apache.org/jira/browse/SPARK-26439 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > Labels: performance > Fix For: 2.4.0 > > > Currently, Barrier TaskSet has a hard requirement that tasks can only be > launched > in a single resourceOffers round with enough slots(or sufficient resources), > but > can not be guaranteed even if with enough slots due to task locality delay > scheduling. > So, it is very likely that Barrier TaskSet gets a chunk of sufficient > resources after > all the trouble, but let it go easily just beacuae one of pending tasks can > not be > scheduled. Futhermore, it causes severe resource competition between > TaskSets and jobs > and introduce unclear semantic for DynamicAllocation. > This JIRA trys to introduce WorkerOffer reservation mechanism for Barrier > TaskSet, which > allows Barrier TaskSet to reserve WorkerOffer in each resourceOffers round, > and launch > tasks at the same time once it accumulate the sufficient resource. In this > way, we > relax the requirement of resources for the Barrier TaskSet. To avoid the > deadlock which > may be introuduced by serveral Barrier TaskSets holding the reserved > WorkerOffers for a > long time, we'll ask Barrier TaskSets to force releasing
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729261#comment-16729261 ] Jacky Li commented on SPARK-24630: -- Actually I encountered this scenario earlier, so we have implemented some commands for using SparkStreaming(or StructureStreaming) on Apache CarbonData, you can refer to the StreamSQL section in this [doc|http://carbondata.apache.org/streaming-guide.html] for more detail. It is good if Spark community can use similar or same syntax, if possible, then in future version of CarbonData can migrate to Spark's syntax. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP V2.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26439) Introduce WorkerOffer reservation mechanism for Barrier TaskSet
[ https://issues.apache.org/jira/browse/SPARK-26439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-26439: - Summary: Introduce WorkerOffer reservation mechanism for Barrier TaskSet (was: Introduce WorkOffer reservation mechanism for Barrier TaskSet) > Introduce WorkerOffer reservation mechanism for Barrier TaskSet > --- > > Key: SPARK-26439 > URL: https://issues.apache.org/jira/browse/SPARK-26439 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > Labels: performance > Fix For: 2.4.0 > > > Currently, Barrier TaskSet has a hard requirement that tasks can only be > launched > in a single resourceOffers round with enough slots(or sufficient resources), > but > can not be guaranteed even if with enough slots due to task locality delay > scheduling. > So, it is very likely that Barrier TaskSet gets a chunk of sufficient > resources after > all the trouble, but let it go easily just beacuae one of pending tasks can > not be > scheduled. Futhermore, it causes severe resource competition between > TaskSets and jobs > and introduce unclear semantic for DynamicAllocation. > This JIRA trys to introduce WorkOffer reservation mechanism for Barrier > TaskSet, which > allows Barrier TaskSet to reserve WorkOffer in each resourceOffers round, > and launch > tasks at the same time once it accumulate the sufficient resource. In this > way, we > relax the requirement of resources for the Barrier TaskSet. To avoid the > deadlock which > may be introuduced by serveral Barrier TaskSets holding the reserved > WorkOffer for a > long time, we'll ask Barrier TaskSets to force releasing part of reserved > WorkOffers > on demand. So, it is highly possible that each Barrier TaskSet would be > launched in the > end. > To integrate with DynamicAllocation > The possible effective way I can imagine is that adding new event, e.g. > ExecutorReservedEvent, ExecutorReleasedEvent, which behaved like busy > executor with > running tasks or idle executor without running tasks. Thus, > ExecutionAllocationManager > would not let the executor go if it reminds of there're some reserved > resource on that > executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26451) Change lead/lag argument name from count to offset
Deepyaman Datta created SPARK-26451: --- Summary: Change lead/lag argument name from count to offset Key: SPARK-26451 URL: https://issues.apache.org/jira/browse/SPARK-26451 Project: Spark Issue Type: Documentation Components: PySpark, SQL Affects Versions: 2.4.0 Reporter: Deepyaman Datta -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26378) Queries of wide CSV/JSON data slowed after SPARK-26151
[ https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-26378: -- Description: A recent change significantly slowed the queries of wide CSV tables. For example, queries against a 6000 column table slowed by 45-48% when queried with a single executor. The [PR for SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2] changed FailureSafeParser#toResultRow such that the returned function recreates every row, even when the associated input record has no parsing issues and the user specified no corrupt record field in his/her schema. This extra processing is responsible for the slowdown. The change to FailureSafeParser also impacted queries of wide JSON tables as well. I propose that a row should be recreated only if there is a parsing error or columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the row should be used as-is. was: A recent change significantly slowed the queries of wide CSV tables. For example, queries against a 6000 column table slowed by 45-48% when queried with a single executor. The [PR for SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2] changed FailureSafeParser#toResultRow such that the returned function recreates every row, even when the associated input record has no parsing issues and the user specified no corrupt record field in his/her schema. This extra processing is responsible for the slowdown. I propose that a row should be recreated only if there is a parsing error or columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the row should be used as-is. > Queries of wide CSV/JSON data slowed after SPARK-26151 > -- > > Key: SPARK-26378 > URL: https://issues.apache.org/jira/browse/SPARK-26378 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > A recent change significantly slowed the queries of wide CSV tables. For > example, queries against a 6000 column table slowed by 45-48% when queried > with a single executor. > > The [PR for > SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2] > changed FailureSafeParser#toResultRow such that the returned function > recreates every row, even when the associated input record has no parsing > issues and the user specified no corrupt record field in his/her schema. This > extra processing is responsible for the slowdown. > The change to FailureSafeParser also impacted queries of wide JSON tables as > well. > I propose that a row should be recreated only if there is a parsing error or > columns need to be shifted due to the existence of a corrupt column field in > the user-supplied schema. Otherwise, the row should be used as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26378) Queries of wide CSV/JSON data slowed after SPARK-26151
[ https://issues.apache.org/jira/browse/SPARK-26378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-26378: -- Summary: Queries of wide CSV/JSON data slowed after SPARK-26151 (was: Queries of wide CSV data slowed after SPARK-26151) > Queries of wide CSV/JSON data slowed after SPARK-26151 > -- > > Key: SPARK-26378 > URL: https://issues.apache.org/jira/browse/SPARK-26378 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Priority: Major > > A recent change significantly slowed the queries of wide CSV tables. For > example, queries against a 6000 column table slowed by 45-48% when queried > with a single executor. > > The [PR for > SPARK-26151|https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2] > changed FailureSafeParser#toResultRow such that the returned function > recreates every row, even when the associated input record has no parsing > issues and the user specified no corrupt record field in his/her schema. This > extra processing is responsible for the slowdown. > > I propose that a row should be recreated only if there is a parsing error or > columns need to be shifted due to the existence of a corrupt column field in > the user-supplied schema. Otherwise, the row should be used as-is. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset
[ https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26451: Assignee: (was: Apache Spark) > Change lead/lag argument name from count to offset > -- > > Key: SPARK-26451 > URL: https://issues.apache.org/jira/browse/SPARK-26451 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Deepyaman Datta >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset
[ https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26451: Assignee: Apache Spark > Change lead/lag argument name from count to offset > -- > > Key: SPARK-26451 > URL: https://issues.apache.org/jira/browse/SPARK-26451 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Deepyaman Datta >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26450) Map of schema is built too frequently in some wide queries
Bruce Robbins created SPARK-26450: - Summary: Map of schema is built too frequently in some wide queries Key: SPARK-26450 URL: https://issues.apache.org/jira/browse/SPARK-26450 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Bruce Robbins When executing queries with wide projections and wide schemas, Spark rebuilds an attribute map for the same schema many times. For example: {noformat} select * from orctbl where id1 = 1 {noformat} Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq instantiation builds a map of the entire list of 6000 attributes (but not until lazy val exprIdToOrdinal is referenced). Whenever OrcFileFormat reads a new file, it generates a new unsafe projection. That results in this [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] getting called: {code:java} protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = in.map(BindReferences.bindReference(_, inputSchema)) {code} For each column in the projection, this line calls bindReference. Each call passes inputSchema, a Sequence of Attributes, to a parameter position expecting an AttributeSeq. The compiler implicitly calls the constructor for AttributeSeq, which (lazily) builds a map for every attribute in the schema. Therefore, this function builds a map of the entire schema once for each column in the projection, and it does this for each input file. For the above example query, this accounts for 204K instantiations of AttributeSeq. Readers for CSV and JSON tables do something similar. In addition, ProjectExec also creates an unsafe projection for each task. As a result, this [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] gets called, which has the same issue: {code:java} def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = { exprs.map(BindReferences.bindReference(_, inputSchema)) } {code} The above affects all wide queries that have a projection node, regardless of the file reader. For the example query, ProjectExec accounts for the additional 66K instantiations of the AttributeSeq. Spark can save time by pre-building the AttributeSeq right before the map operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size of schema, size of projection, number of input files (for Orc), number of file splits (for CSV, and JSON tables), and number of tasks. For a 6000 column CSV table with 500K records and 34 input files, the time savings is only 6%[1] because Spark doesn't create as many unsafe projections as compared to Orc tables. On the other hand, for a 6000 column Orc table with 500K records and 34 input files, the time savings is about 16%[1]. [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Dataframe.transform
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanan Shteingart updated SPARK-26449: - Shepherd: Maciej Szymkiewicz (was: Lazy Developer) > Dataframe.transform > --- > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26449) Dataframe.transform
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanan Shteingart updated SPARK-26449: - Shepherd: Lazy Developer (was: yc) > Dataframe.transform > --- > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Priority: Minor > Labels: patch > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26449) Dataframe.transform
Hanan Shteingart created SPARK-26449: Summary: Dataframe.transform Key: SPARK-26449 URL: https://issues.apache.org/jira/browse/SPARK-26449 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.0 Reporter: Hanan Shteingart I would like to chain custom transformations as is suggested in this [blog post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] This will allow to write something like the following: {code:java} def with_greeting(df): return df.withColumn("greeting", lit("hi")) def with_something(df, something): return df.withColumn("something", lit(something)) data = [("jose", 1), ("li", 2), ("liz", 3)] source_df = spark.createDataFrame(data, ["name", "age"]) actual_df = (source_df .transform(with_greeting) .transform(lambda df: with_something(df, "crazy"))) print(actual_df.show()) ++---++-+ |name|age|greeting|something| ++---++-+ |jose| 1| hi|crazy| | li| 2| hi|crazy| | liz| 3| hi|crazy| ++---++-+ {code} The only thing needed to accomplish this is the following simple method for DataFrame: {code:java} from pyspark.sql.dataframe import DataFrame def transform(self, f): return f(self) DataFrame.transform = transform {code} I volunteer to do the pull request if approved (at least the python part) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23959) UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0
[ https://issues.apache.org/jira/browse/SPARK-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729148#comment-16729148 ] Sam hendley commented on SPARK-23959: - I am working on upgrading a medium sized project from spark 2.0.2 to spark 2.3.0 and ran into this bug in a few of my unit tests. The reproduction case included in this ticket fails in my environment. Adding a `.cache()` to `zs` seems to fix the issue as expected. It looks like the code in question is working in my production environment when all of the input datasets are populated and loaded from parquet files. In my tests I was using createDataset() calls to store intermediate results. If I check for empty input data and call .cache() on the resulting frame it works in my unit tests. Do you have any guesses on what might be different in my environment that would make this fail? I tried changing hadoop versions (2.6.5 and 2.8.3) and spark versions (2.3.0 and 2.3.2) but was still able to reproduce this issue. Is there anything I can do to help you debug this issue? > UnresolvedException with DataSet created from Seq.empty since Spark 2.3.0 > - > > Key: SPARK-23959 > URL: https://issues.apache.org/jira/browse/SPARK-23959 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sam De Backer >Priority: Major > > The following snippet works fine in Spark 2.2.1 but gives a rather cryptic > runtime exception in Spark 2.3.0: > {code:java} > import sparkSession.implicits._ > import org.apache.spark.sql.functions._ > case class X(xid: Long, yid: Int) > case class Y(yid: Int, zid: Long) > case class Z(zid: Long, b: Boolean) > val xs = Seq(X(1L, 10)).toDS() > val ys = Seq(Y(10, 100L)).toDS() > val zs = Seq.empty[Z].toDS() > val j = xs > .join(ys, "yid") > .join(zs, Seq("zid"), "left") > .withColumn("BAM", when('b, "B").otherwise("NB")) > j.show(){code} > In Spark 2.2.1 it prints to the console > {noformat} > +---+---+---++---+ > |zid|yid|xid| b|BAM| > +---+---+---++---+ > |100| 10| 1|null| NB| > +---+---+---++---+{noformat} > In Spark 2.3.0 it results in: > {noformat} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'BAM > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) > at > org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) > at > org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:435) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:435) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:157) > ...{noformat} > The culprit really seems to be DataSet being created from an empty Seq[Z]. > When you change that to something that will also result in an empty > DataSet[Z] it works as in Spark 2.2.1, e.g. > {code:java} > val zs = Seq(Z(10L, true)).toDS().filter('zid < Long.MinValue){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26448) retain the difference between 0.0 and -0.0
[ https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26448: Assignee: Wenchen Fan (was: Apache Spark) > retain the difference between 0.0 and -0.0 > -- > > Key: SPARK-26448 > URL: https://issues.apache.org/jira/browse/SPARK-26448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26448) retain the difference between 0.0 and -0.0
[ https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26448: Assignee: Apache Spark (was: Wenchen Fan) > retain the difference between 0.0 and -0.0 > -- > > Key: SPARK-26448 > URL: https://issues.apache.org/jira/browse/SPARK-26448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26448) retain the difference between 0.0 and -0.0
Wenchen Fan created SPARK-26448: --- Summary: retain the difference between 0.0 and -0.0 Key: SPARK-26448 URL: https://issues.apache.org/jira/browse/SPARK-26448 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26447: Assignee: (was: Apache Spark) > Allow OrcColumnarBatchReader to return less partition columns > - > > Key: SPARK-26447 > URL: https://issues.apache.org/jira/browse/SPARK-26447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > Currently OrcColumnarBatchReader returns all the partition column values in > the batch read. > In data source V2, we can improve it by returning the required partition > column values only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns
[ https://issues.apache.org/jira/browse/SPARK-26447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26447: Assignee: Apache Spark > Allow OrcColumnarBatchReader to return less partition columns > - > > Key: SPARK-26447 > URL: https://issues.apache.org/jira/browse/SPARK-26447 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently OrcColumnarBatchReader returns all the partition column values in > the batch read. > In data source V2, we can improve it by returning the required partition > column values only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26447) Allow OrcColumnarBatchReader to return less partition columns
Gengliang Wang created SPARK-26447: -- Summary: Allow OrcColumnarBatchReader to return less partition columns Key: SPARK-26447 URL: https://issues.apache.org/jira/browse/SPARK-26447 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Currently OrcColumnarBatchReader returns all the partition column values in the batch read. In data source V2, we can improve it by returning the required partition column values only. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26446) improve doc on ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26446: Assignee: Apache Spark > improve doc on ExecutorAllocationManager > > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26446) improve doc on ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26446: Assignee: (was: Apache Spark) > improve doc on ExecutorAllocationManager > > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26446) improve doc on ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingxin Wu updated SPARK-26446: --- Component/s: (was: Scheduler) Spark Core > improve doc on ExecutorAllocationManager > > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26446) improve doc on ExecutorAllocationManager
Qingxin Wu created SPARK-26446: -- Summary: improve doc on ExecutorAllocationManager Key: SPARK-26446 URL: https://issues.apache.org/jira/browse/SPARK-26446 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.4.0 Reporter: Qingxin Wu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.
Takuya Ueshin created SPARK-26445: - Summary: Use ConfigEntry for hardcoded configs for driver/executor categories. Key: SPARK-26445 URL: https://issues.apache.org/jira/browse/SPARK-26445 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin Make hardcoded "spark.driver" and "spark.executor" configs to use {{ConfigEntry}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26445) Use ConfigEntry for hardcoded configs for driver/executor categories.
[ https://issues.apache.org/jira/browse/SPARK-26445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728974#comment-16728974 ] Takuya Ueshin commented on SPARK-26445: --- I'm working on this. > Use ConfigEntry for hardcoded configs for driver/executor categories. > - > > Key: SPARK-26445 > URL: https://issues.apache.org/jira/browse/SPARK-26445 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Make hardcoded "spark.driver" and "spark.executor" configs to use > {{ConfigEntry}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26444) Stage color doesn't change with it's status
[ https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26444: Assignee: Apache Spark > Stage color doesn't change with it's status > --- > > Key: SPARK-26444 > URL: https://issues.apache.org/jira/browse/SPARK-26444 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Apache Spark >Priority: Major > Attachments: active.png, complete.png, failed.png > > > On job page, in event timeline section, stage color doesn't change according > to its status. See attachments for some screen shots. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26444) Stage color doesn't change with it's status
[ https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26444: Assignee: (was: Apache Spark) > Stage color doesn't change with it's status > --- > > Key: SPARK-26444 > URL: https://issues.apache.org/jira/browse/SPARK-26444 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > Attachments: active.png, complete.png, failed.png > > > On job page, in event timeline section, stage color doesn't change according > to its status. See attachments for some screen shots. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status
[ https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26444: - Attachment: failed.png complete.png active.png > Stage color doesn't change with it's status > --- > > Key: SPARK-26444 > URL: https://issues.apache.org/jira/browse/SPARK-26444 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > Attachments: active.png, complete.png, failed.png > > > On job page, in event timeline section, stage color doesn't change according > to its status. See attachments for some screen shots. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26444) Stage color doesn't change with it's status
[ https://issues.apache.org/jira/browse/SPARK-26444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao updated SPARK-26444: - Description: On job page, in event timeline section, stage color doesn't change according to its status. See attachments for some screen shots. was: On job page, in event timeline section, stage color doesn't change according to its status. Below are some screen shots. active: !image-2018-12-26-16-14-38-958.png! complete: !image-2018-12-26-16-15-55-957.png! failed: !image-2018-12-26-16-16-11-697.png! > Stage color doesn't change with it's status > --- > > Key: SPARK-26444 > URL: https://issues.apache.org/jira/browse/SPARK-26444 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > On job page, in event timeline section, stage color doesn't change according > to its status. See attachments for some screen shots. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26444) Stage color doesn't change with it's status
Chenxiao Mao created SPARK-26444: Summary: Stage color doesn't change with it's status Key: SPARK-26444 URL: https://issues.apache.org/jira/browse/SPARK-26444 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0 Reporter: Chenxiao Mao On job page, in event timeline section, stage color doesn't change according to its status. Below are some screen shots. active: !image-2018-12-26-16-14-38-958.png! complete: !image-2018-12-26-16-15-55-957.png! failed: !image-2018-12-26-16-16-11-697.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org