UDF issues with spark
Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when running the udf function on a recordset count of 10 with a mapping of the same value (arbirtrary for testing purposes). This is on amazon EMR release label 5.6.0 with the following hardware specs m4.4xlarge 32 vCPU, 64 GiB memory, EBS only storage EBS Storage:100 GiB Help? This message is confidential, intended only for the named recipient(s) and may contain information that is privileged or exempt from disclosure under applicable law. If you are not the intended recipient(s), you are notified that the dissemination, distribution, or copying of this message is strictly prohibited. If you receive this message in error or are not the named recipient(s), please notify the sender by return email and delete this message. Thank you.
Best way of shipping self-contained pyspark jobs with 3rd-party dependencies
Hi PySparkers, What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies? There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], [7] regarding setting up the python environment with conda and virtualenv respectively, and I believe [7] is misleading article, because of unsupported spark options, like spark.pyspark.virtualenv.enabled, spark.pyspark.virtualenv.requirements, etc. So I'm wondering what the community does in cases, when it's necessary to - prevent python package/module version conflicts between different jobs - prevent updating all the nodes of the cluster in case of new job dependencies - track which dependencies are introduced on the per-job basis [1] https://issues.apache.org/jira/browse/SPARK-13587 [2] https://issues.apache.org/jira/browse/SPARK-16367 [3] https://github.com/apache/spark/pull/13599 [4] https://github.com/apache/spark/pull/14180 [5] https://www.anaconda.com/blog/developer-blog/conda-spark [6] http://henning.kropponline.de/2016/09/17/running-pyspark-with-virtualenv [7] https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Programmatically get status of job (WAITING/RUNNING)
Qiao, Richard wrote > Comparing #1 and #3, my understanding of “submitted” is “the jar is > submitted to executors”. With this concept, you may define your own > status. In SparkLauncher, SUBMITTED means that the Driver was able to acquire cores from Spark cluster and Launcher is waiting for Driver to connect back. Once it connects back, the state of Driver is changed to CONNECTED. As Marcelo mentioned, Launcher can only tell me about the Driver state and it is not possible to guess the state of "application (executors)". For the state of executors we can use SparkListener. With the combination of both Launcher + Listener, I have a solution. As you mentioned, that even if 1 executor is allocated to "application", the state will change to RUNNING. So in my application, I change the status of my job to RUNNING only if I receive RUNNING from Launcher and onExecuterAdded event from SparkListener. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Programmatically get status of job (WAITING/RUNNING)
Qiao, Richard wrote > For your question of example, the answer is yes. Perfect. I am assuming that this is true for Spark-standalone/YARN/Mesos. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org