[ 
https://issues.apache.org/jira/browse/SPARK-22340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leif Walsh updated SPARK-22340:
-------------------------------
    Description: 
With pyspark, {{sc.setJobGroup}}'s documentation says

{quote}
Assigns a group ID to all the jobs started by this thread until the group ID is 
set to a different value or cleared.
{quote}

However, this doesn't appear to be associated with Python threads, only with 
Java threads.  As such, a Python thread which calls this and then submits 
multiple jobs doesn't necessarily get its jobs associated with any particular 
spark job group.  For example:

{code}
def run_jobs():
    sc.setJobGroup('hello', 'hello jobs')
    x = sc.range(100).sum()
    y = sc.range(1000).sum()
    return x, y

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
    future = executor.submit(run_jobs)
    sc.cancelJobGroup('hello')
    future.result()
{code}

In this example, depending how the action calls on the Python side are 
allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
assigned the job group {{hello}}.

First, we should clarify the docs if this truly is the case.

Second, it would be really helpful if we could make the job group assignment 
reliable for a Python thread, though I’m not sure the best way to do this.  As 
it stands, job groups are pretty useless from the pyspark side, if we can't 
rely on this fact.

My only idea so far is to mimic the TLS behavior on the Python side and then 
patch every point where job submission may take place to pass that in, but this 
feels pretty brittle. In my experience with py4j, controlling threading there 
is a challenge. 

  was:
With pyspark, {{sc.setJobGroup}}'s documentation says

{quote}
Assigns a group ID to all the jobs started by this thread until the group ID is 
set to a different value or cleared.
{quote}

However, this doesn't appear to be associated with Python threads, only with 
Java threads.  As such, a Python thread which calls this and then submits 
multiple jobs doesn't necessarily get its jobs associated with any particular 
spark job group.  For example:

{code}
def run_jobs():
    sc.setJobGroup('hello', 'hello jobs')
    x = sc.range(100).sum()
    y = sc.range(1000).sum()
    return x, y

import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
    future = executor.submit(run_jobs)
    sc.cancelJobGroup('hello')
    future.result()
{code}

In this example, depending how the action calls on the Python side are 
allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
assigned the job group {{hello}}.

First, we should clarify the docs if this truly is the case.

Second, it would be really helpful if this could be made the case.  As it 
stands, job groups are pretty useless from the pyspark side, if we can't rely 
on this fact.


> pyspark setJobGroup doesn't match java threads
> ----------------------------------------------
>
>                 Key: SPARK-22340
>                 URL: https://issues.apache.org/jira/browse/SPARK-22340
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.2
>            Reporter: Leif Walsh
>
> With pyspark, {{sc.setJobGroup}}'s documentation says
> {quote}
> Assigns a group ID to all the jobs started by this thread until the group ID 
> is set to a different value or cleared.
> {quote}
> However, this doesn't appear to be associated with Python threads, only with 
> Java threads.  As such, a Python thread which calls this and then submits 
> multiple jobs doesn't necessarily get its jobs associated with any particular 
> spark job group.  For example:
> {code}
> def run_jobs():
>     sc.setJobGroup('hello', 'hello jobs')
>     x = sc.range(100).sum()
>     y = sc.range(1000).sum()
>     return x, y
> import concurrent.futures
> with concurrent.futures.ThreadPoolExecutor() as executor:
>     future = executor.submit(run_jobs)
>     sc.cancelJobGroup('hello')
>     future.result()
> {code}
> In this example, depending how the action calls on the Python side are 
> allocated to Java threads, the jobs for {{x}} and {{y}} won't necessarily be 
> assigned the job group {{hello}}.
> First, we should clarify the docs if this truly is the case.
> Second, it would be really helpful if we could make the job group assignment 
> reliable for a Python thread, though I’m not sure the best way to do this.  
> As it stands, job groups are pretty useless from the pyspark side, if we 
> can't rely on this fact.
> My only idea so far is to mimic the TLS behavior on the Python side and then 
> patch every point where job submission may take place to pass that in, but 
> this feels pretty brittle. In my experience with py4j, controlling threading 
> there is a challenge. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to