[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

Yuriy (Jira) Tue, 23 Feb 2021 12:41:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yuriy updated SPARK-34510:
--------------------------
    Description: 
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package.
{code:java}
process.py
file_processor
  config        
    spark.py
  repository        
    s3_repo.py
  structure        
    table_creator.py

{code}
The command hangs when the .foreachPartition code that is located inside 
s3_repo.py is called by process.py. When the same .foreachPartition code is 
moved from s3_repo.py and placed inside the process.py it runs just fine.

 

process.py
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
    table_creator.create_table()
    s3_repo.save_to_s3()

if __name__ == '__main__':
    process()
{code}
spark.py
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
s3_repo.py

 

 

  was:I provided full description of the issue on Stack Overflow via the 
following link https://stackoverflow.com/questions/66300313


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-34510
>                 URL: https://issues.apache.org/jira/browse/SPARK-34510
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, PySpark
>    Affects Versions: 3.0.0
>            Reporter: Yuriy
>            Priority: Minor
>         Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py 
> is what controls the flow of the application and calls code inside the 
> file_processor package.
> {code:java}
> process.py
> file_processor
>   config        
>     spark.py
>   repository        
>     s3_repo.py
>   structure        
>     table_creator.py
> {code}
> The command hangs when the .foreachPartition code that is located inside 
> s3_repo.py is called by process.py. When the same .foreachPartition code is 
> moved from s3_repo.py and placed inside the process.py it runs just fine.
>  
> process.py
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
>     table_creator.create_table()
>     s3_repo.save_to_s3()
> if __name__ == '__main__':
>     process()
> {code}
> spark.py
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> s3_repo.py
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

Reply via email to