[jira] [Commented] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438188#comment-16438188 ] Matthew Anthony commented on SPARK-23970: - Understood but this isnhorribly inefficient. In the specific use case this job ran for 18 hours without completion. I'm able to complete without the coalesce in a fraction of that time. This May not be a bug but it certainly is not an efficient execution pattern for most spark queries. On Fri, Apr 13, 2018, 6:27 PM Liang-Chi Hsieh (JIRA)> pyspark - simple filter/select doesn't use all tasks when coalesce is set > - > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Matthew Anthony >Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select > from > where """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select > from > where """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in . The scale of the task I am testing on generates > approximately 300,000 read tasks in the latter version of the code when not > constrained by the former's execution plan. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Anthony updated SPARK-23970: Description: Running in (py)spark 2.2. Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API. {code:java} df = spark.sql( """ select from where from where . The scale of the task I am testing on generates approximately 300,000 read tasks in the latter version of the code when not constrained by the former's execution plan. was: Running in (py)spark 2.2. Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API. {code:java} df = spark.sql( """ select from where from where . > pyspark - simple filter/select doesn't use all tasks when coalesce is set > - > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Matthew Anthony >Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select > from > where """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select > from > where """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in . The scale of the task I am testing on generates > approximately 300,000 read tasks in the latter version of the code when not > constrained by the former's execution plan. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Anthony updated SPARK-23970: Description: Running in (py)spark 2.2. Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API. {code:java} df = spark.sql( """ select from where from where . was: Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API. {code:java} df = spark.sql( """ select from where from where . > pyspark - simple filter/select doesn't use all tasks when coalesce is set > - > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Matthew Anthony >Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select > from > where """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select > from > where """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in . > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set
Matthew Anthony created SPARK-23970: --- Summary: pyspark - simple filter/select doesn't use all tasks when coalesce is set Key: SPARK-23970 URL: https://issues.apache.org/jira/browse/SPARK-23970 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1, 2.2.0 Reporter: Matthew Anthony Marking this as PySpark, but have not confirmed whether this is Spark-wide; I've observed it in pyspark which is my preferred API. {code:java} df = spark.sql( """ select from where from where . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org