[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

Furcy Pin (Jira) Fri, 04 Oct 2019 07:50:11 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944564#comment-16944564
 ]


Furcy Pin commented on SPARK-13587:
-----------------------------------

Hello,

I don't know where to ask this, but we have been using this feature on 
HDInsight 2.6.5 and we sometimes have a concurrency issue with pip.
 Basically it looks like in rare occasions, several executors set up the 
virtualenv simultaneously, which ends up in a kind of deadlock.

When running the pip install command used by the executor manually, it suddenly 
hangs and when cancel throws this error :
{code:java}
File 
"/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py",
 line 31, in acquire
 os.link(self.unique_name, self.lock_file)
 FileExistsError: [Errno 17] File exists: '/home/yarn/XXXXXXXX-XXXXXXXX' -> 
'/home/yarn/selfcheck.json.lock'{code}
This happens with "spark.pyspark.virtualenv.type=native". 
We haven't tried with conda yet.

It is pretty bad because when it happens:
 - some executors of the spark job just get stuck, and the spark job gets stuck
 - even if the job is restarted, the lock files stays there and causes the 
whole YARN host to be useless.

Any suggestion or workaround would be appreciated. 
 One idea would be to remove the "--cache-dir /home/yarn" option which is 
currently used in the pip install command, but it doesn't seem to be 
configurable right now.

> Support virtualenv in PySpark
> -----------------------------
>
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>            Reporter: Jeff Zhang
>            Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

Reply via email to