[jira] [Commented] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set

2018-04-13 Thread Matthew Anthony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438188#comment-16438188
 ] 

Matthew Anthony commented on SPARK-23970:
-

Understood but this isnhorribly inefficient. In the specific use case this
job ran for 18 hours without completion. I'm able to complete without the
coalesce in a fraction of that time. This May not be a bug but it certainly
is not an efficient execution pattern for most spark queries.

On Fri, Apr 13, 2018, 6:27 PM Liang-Chi Hsieh (JIRA) 



> pyspark - simple filter/select doesn't use all tasks when coalesce is set
> -
>
> Key: SPARK-23970
> URL: https://issues.apache.org/jira/browse/SPARK-23970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Matthew Anthony
>Priority: Major
>
> Running in (py)spark 2.2. 
> Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
> I've observed it in pyspark which is my preferred API.
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> )
> df.coalesce(32).write.parquet(...){code}
> The above code will only attempt to use 32 tasks to read and process all of 
> the original input data. This compares to 
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> ).cache()
> df.count()
> df.coalesce(32).write.parquet(...){code}
> where this will use the full complement of tasks available to the cluster to 
> do the initial filter, with a subsequent shuffle to coalesce and write. The 
> latter execution path is way more efficient, particularly at large volumes 
> where filtering will remove most records and should be the default. Note that 
> in the real setting in which I am running this, I'm operating a 20 node 
> cluster with 16 cores and 56gb RAM per machine, and processing well over a TB 
> of raw data in . The scale of the task I am testing on generates 
> approximately 300,000 read tasks in the latter version of the code when not 
> constrained by the former's execution plan.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set

2018-04-12 Thread Matthew Anthony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Anthony updated SPARK-23970:

Description: 
Running in (py)spark 2.2. 

Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
I've observed it in pyspark which is my preferred API.
{code:java}
df = spark.sql(
"""
select 
from 
where 
from 
where . The scale of the task I am testing on generates approximately 
300,000 read tasks in the latter version of the code when not constrained by 
the former's execution plan.

 

  was:
Running in (py)spark 2.2. 

Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
I've observed it in pyspark which is my preferred API.
{code:java}
df = spark.sql(
"""
select 
from 
where 
from 
where .

 


> pyspark - simple filter/select doesn't use all tasks when coalesce is set
> -
>
> Key: SPARK-23970
> URL: https://issues.apache.org/jira/browse/SPARK-23970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Matthew Anthony
>Priority: Major
>
> Running in (py)spark 2.2. 
> Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
> I've observed it in pyspark which is my preferred API.
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> )
> df.coalesce(32).write.parquet(...){code}
> The above code will only attempt to use 32 tasks to read and process all of 
> the original input data. This compares to 
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> ).cache()
> df.count()
> df.coalesce(32).write.parquet(...){code}
> where this will use the full complement of tasks available to the cluster to 
> do the initial filter, with a subsequent shuffle to coalesce and write. The 
> latter execution path is way more efficient, particularly at large volumes 
> where filtering will remove most records and should be the default. Note that 
> in the real setting in which I am running this, I'm operating a 20 node 
> cluster with 16 cores and 56gb RAM per machine, and processing well over a TB 
> of raw data in . The scale of the task I am testing on generates 
> approximately 300,000 read tasks in the latter version of the code when not 
> constrained by the former's execution plan.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set

2018-04-12 Thread Matthew Anthony (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Anthony updated SPARK-23970:

Description: 
Running in (py)spark 2.2. 

Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
I've observed it in pyspark which is my preferred API.
{code:java}
df = spark.sql(
"""
select 
from 
where 
from 
where .

 

  was:
 

Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
I've observed it in pyspark which is my preferred API.
{code:java}
df = spark.sql(
"""
select 
from 
where 
from 
where .

 


> pyspark - simple filter/select doesn't use all tasks when coalesce is set
> -
>
> Key: SPARK-23970
> URL: https://issues.apache.org/jira/browse/SPARK-23970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Matthew Anthony
>Priority: Major
>
> Running in (py)spark 2.2. 
> Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
> I've observed it in pyspark which is my preferred API.
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> )
> df.coalesce(32).write.parquet(...){code}
> The above code will only attempt to use 32 tasks to read and process all of 
> the original input data. This compares to 
> {code:java}
> df = spark.sql(
> """
> select 
> from 
> where  """
> ).cache()
> df.count()
> df.coalesce(32).write.parquet(...){code}
> where this will use the full complement of tasks available to the cluster to 
> do the initial filter, with a subsequent shuffle to coalesce and write. The 
> latter execution path is way more efficient, particularly at large volumes 
> where filtering will remove most records and should be the default. Note that 
> in the real setting in which I am running this, I'm operating a 20 node 
> cluster with 16 cores and 56gb RAM per machine, and processing well over a TB 
> of raw data in .
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23970) pyspark - simple filter/select doesn't use all tasks when coalesce is set

2018-04-12 Thread Matthew Anthony (JIRA)
Matthew Anthony created SPARK-23970:
---

 Summary: pyspark - simple filter/select doesn't use all tasks when 
coalesce is set
 Key: SPARK-23970
 URL: https://issues.apache.org/jira/browse/SPARK-23970
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1, 2.2.0
Reporter: Matthew Anthony


 

Marking this as PySpark, but have not confirmed whether this is Spark-wide; 
I've observed it in pyspark which is my preferred API.
{code:java}
df = spark.sql(
"""
select 
from 
where 
from 
where .

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org