[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion
[ https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293009#comment-15293009 ] Nick White commented on SPARK-15176: [~kayousterhout] [~irashid] We use Spark to serve interactive queries submitted by end-users. The data the queries run on is refreshed periodically, and there's a high IO cost to reading it (as it lives in S3). We're using the linked PR to support two pools; one serves user queries (and so always needs hardware resources available for responsiveness) and the other loads new data into memory as cached RDDs and performs some basic indexing. When the new data is fully cached it's swapped with the set of RDDs the "query" pool runs against - so users see no degradation of performance as their queries never hit uncached data. Under the existing scheduler implementation, we've seen tasks from the caching & indexing pool use all up all the hardware resources, and when a user query arrives the query's tasks have to wait for indexing tasks to finish before they can start executing (at which point the fair scheduler ensures both the query and the indexing job make progress). > Job Scheduling Within Application Suffers from Priority Inversion > - > > Key: SPARK-15176 > URL: https://issues.apache.org/jira/browse/SPARK-15176 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1 >Reporter: Nick White > > Say I have two pools, and N cores in my cluster: > * I submit a job to one, which has M >> N tasks > * N of the M tasks are scheduled > * I submit a job to the second pool - but none of its tasks get scheduled > until a task from the other pool finishes! > This can lead to unbounded denial-of-service for the second pool - regardless > of `minShare` or `weight` settings. Ideally Spark would support a pre-emption > mechanism, or an upper bound on a pool's resource usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion
Nick White created SPARK-15176: -- Summary: Job Scheduling Within Application Suffers from Priority Inversion Key: SPARK-15176 URL: https://issues.apache.org/jira/browse/SPARK-15176 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.6.1 Reporter: Nick White Say I have two pools, and N cores in my cluster: * I submit a job to one, which has M >> N tasks * N of the M tasks are scheduled * I submit a job to the second pool - but none of its tasks get scheduled until a task from the other pool finishes! This can lead to unbounded denial-of-service for the second pool - regardless of `minShare` or `weight` settings. Ideally Spark would support a pre-emption mechanism, or an upper bound on a pool's resource usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable
[ https://issues.apache.org/jira/browse/SPARK-14859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15254665#comment-15254665 ] Nick White commented on SPARK-14859: I've got a PR for this here: https://github.com/apache/spark/pull/12620 > [PYSPARK] Make Lambda Serializer Configurable > - > > Key: SPARK-14859 > URL: https://issues.apache.org/jira/browse/SPARK-14859 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Nick White > > Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded > reference to the CloudPickleSerializer. The serializer should be > configurable, as these lambdas may contain complex objects that need custom > serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14859) [PYSPARK] Make Lambda Serializer Configurable
Nick White created SPARK-14859: -- Summary: [PYSPARK] Make Lambda Serializer Configurable Key: SPARK-14859 URL: https://issues.apache.org/jira/browse/SPARK-14859 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.0.0 Reporter: Nick White Currently lambdas (e.g. used in RDD.map) are serialized by a hardcoded reference to the CloudPickleSerializer. The serializer should be configurable, as these lambdas may contain complex objects that need custom serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org