[ https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438188#comment-16438188 ]
Matthew Anthony commented on SPARK-23970: ----------------------------------------- Understood but this isnhorribly inefficient. In the specific use case this job ran for 18 hours without completion. I'm able to complete without the coalesce in a fraction of that time. This May not be a bug but it certainly is not an efficient execution pattern for most spark queries. On Fri, Apr 13, 2018, 6:27 PM Liang-Chi Hsieh (JIRA) <j...@apache.org> > pyspark - simple filter/select doesn't use all tasks when coalesce is set > ------------------------------------------------------------------------- > > Key: SPARK-23970 > URL: https://issues.apache.org/jira/browse/SPARK-23970 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.2.0, 2.2.1 > Reporter: Matthew Anthony > Priority: Major > > Running in (py)spark 2.2. > Marking this as PySpark, but have not confirmed whether this is Spark-wide; > I've observed it in pyspark which is my preferred API. > {code:java} > df = spark.sql( > """ > select <redacted> > from <inputtbl> > where <conditions > """ > ) > df.coalesce(32).write.parquet(...){code} > The above code will only attempt to use 32 tasks to read and process all of > the original input data. This compares to > {code:java} > df = spark.sql( > """ > select <redacted> > from <inputtbl> > where <conditions > """ > ).cache() > df.count() > df.coalesce(32).write.parquet(...){code} > where this will use the full complement of tasks available to the cluster to > do the initial filter, with a subsequent shuffle to coalesce and write. The > latter execution path is way more efficient, particularly at large volumes > where filtering will remove most records and should be the default. Note that > in the real setting in which I am running this, I'm operating a 20 node > cluster with 16 cores and 56gb RAM per machine, and processing well over a TB > of raw data in <inputtbl>. The scale of the task I am testing on generates > approximately 300,000 read tasks in the latter version of the code when not > constrained by the former's execution plan. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org