Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
Hi Marcelo, I see what you mean. Tried it but still got same error message. Error from python worker: > Traceback (most recent call last): > File "/usr/local/Python-3.6.4/lib/python3.6/runpy.py", line 183, in > _run_module_as_main > mod_name, mod_spec, code =

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
Thanks Marcelo, But I don't want to install 2.3.2 on the worker nodes. I just want Spark to use the path of the files uploaded to YARN instead of the SPARK_HOME. On Fri, Oct 5, 2018 at 1:25 AM Marcelo Vanzin wrote: > Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get >

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
Yes, that's right. On Fri, Oct 5, 2018 at 3:35 AM Gourav Sengupta wrote: > Hi Marcelo, > it will be great if you illustrate what you mean, I will be interested to > know. > > Hi Jianshi, > so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1 > installed in your cluster? > >

Where is the DAG stored before catalyst gets it?

2018-10-04 Thread Jean Georges Perrin
Hi, I am assuming it is still in the master and when catalyst is finished it sends the tasks to the workers. Correct? tia jg - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: PySpark structured streaming job throws socket exception

2018-10-04 Thread mmuru
Thanks Ryan. Attached the whole stack trace. Let me know if you need more information. pyspark-driver-pod-exception.txt -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: PySpark structured streaming job throws socket exception

2018-10-04 Thread Shixiong(Ryan) Zhu
As far as I know, the error log in updateAccumulators will not fail a Spark task. Did you see other error messages? Best Regards, Ryan On Thu, Oct 4, 2018 at 2:14 PM mmuru wrote: > Hi, > > Running Pyspark structured streaming job on K8S with 2 executor pods. The > driver pod failed with the

PySpark structured streaming job throws socket exception

2018-10-04 Thread mmuru
Hi, Running Pyspark structured streaming job on K8S with 2 executor pods. The driver pod failed with the following up exception. It fails consistently after 3 to 6hrs of running. Any idea how to fix this exception. I really appreciate your help. 2018-10-04 18:48:27 ERROR DAGScheduler:91 -

Re: Pyspark Partitioning

2018-10-04 Thread Vitaliy Pisarev
Groupby is an operator you would use if you wanted to *aggregate* the values that are grouped by rhe specify key. In your case you want to retain access to the values. You need to do df.partitionBy and then you can map the partirions. Of course you need to be carefull of potential skews in the

Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone, Here is an issue that i am facing in partitioning dtafarame. I have a dataframe which called data_df. It is look like: Group_Id | Object_Id | Trajectory 1 | obj1| Traj1 2 | obj2| Traj2 1 | obj3| Traj3 3 |

Spark 2.3.1 leaves _temporary dir back on s3 even after write to s3 is done.

2018-10-04 Thread sushil.chaudhary
folks, We recently upgraded to 2.3.1 and we started seeing that, the spark jobs leaves _temporary directory in the s3 even though write to s3 already finish. It do not cleanup the temporary directory. Hadoop version 2.8. is there a way to control it? -- Sent from:

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Gourav Sengupta
Hi Marcelo, it will be great if you illustrate what you mean, I will be interested to know. Hi Jianshi, so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1 installed in your cluster? Regards, Gourav Sengupta On Thu, Oct 4, 2018 at 6:26 PM Marcelo Vanzin wrote: > Try

subscribe

2018-10-04 Thread Sushil Chaudhary
-- Regards, Sushil The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Marcelo Vanzin
Try "spark.executorEnv.SPARK_HOME=$PWD" (in quotes so it does not get expanded by the shell). But it's really weird to be setting SPARK_HOME in the environment of your node managers. YARN shouldn't need to know about that. On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang wrote: > >

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d78f5b524128/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala#L31 The code shows Spark will try to find the path if SPARK_HOME is specified. And on my worker node, SPARK_HOME is specified in .bashrc , for the

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Marcelo Vanzin
Normally the version of Spark installed on the cluster does not matter, since Spark is uploaded from your gateway machine to YARN by default. You probably have some configuration (in spark-defaults.conf) that tells YARN to use a cached copy. Get rid of that configuration, and you can use whatever

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-04 Thread zakhavan
Thank you. It helps. Zeinab -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: java.lang.IllegalArgumentException: requirement failed: BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes

2018-10-04 Thread hager
please, I have same problem. Have you found any solution? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Apostolos N. Papadopoulos
Maybe this can help. https://stackoverflow.com/questions/32959723/set-python-path-for-spark-worker On 04/10/2018 12:19 μμ, Jianshi Huang wrote: Hi, I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
Hi, I have a problem using multiple versions of Pyspark on YARN, the driver and worker nodes are all preinstalled with Spark 2.2.1, for production tasks. And I want to use 2.3.2 for my personal EDA. I've tried both 'pyFiles=' option and sparkContext.addPyFiles(), however on the worker node, the

Re: Use SparkContext in Web Application

2018-10-04 Thread Girish Vasmatkar
Thank you Vincent and Jorn for your inputs, much appreciated. Our web-app already has a scheduler mechanism and other jobs are already running in the system. Would you still prefer to decouple model training in a separate scheduling tool outside of our web-app JVM? We are using test Data for now

Re: Use SparkContext in Web Application

2018-10-04 Thread vincent gromakowski
Decoupling the web app from Spark backend is recommended. Training the model can be launched in the background via a scheduling tool. Inferring the model with Spark in interactive mode s not a good option as it will do it for unitary data and Spark is better in using large dataset. The original

Re: Use SparkContext in Web Application

2018-10-04 Thread Jörn Franke
Depending on your model size you can store it as PFA or PMML and run the prediction in Java. For larger models you will need a custom solution , potentially using a spark thrift Server/spark job server/Livy and a cache to store predictions that have been already calculated (eg based on previous