Attempting to avoid a shuffle on join

2019-07-03 Thread Mkal
Please keep in mind i'm fairly new to spark. I have some spark code where i load two textfiles as datasets and after some map and filter operations to bring the columns in a specific shape, i join the datasets. The join takes place on a common column (of type string). Is there any way to avoid

Re: Spark on yarn - application hangs

2019-05-10 Thread Mkal
How can i check what exactly is stagnant? Do you mean on the DAG visualization on Spark UI? Sorry i'm new to spark. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Spark on yarn - application hangs

2019-05-10 Thread Mkal
I've built a spark job in which an external program is called through the use of pipe(). Job runs correctly on cluster when the input is a small sample dataset but when the input is a real large dataset it stays on RUNNING state forever. I've tried different ways to tune executor memory, executor

C++ script on Spark Cluster throws exit status 132

2019-03-05 Thread Mkal
I'm trying to run a c++ program on spark cluster by using the rdd.pipe() operation but the executors throw: java.lang.IllegalStateException: Subprocess exited with status 132. The spark jar runs totally fine on standalone and the c++ program runs just fine on its own as well. I've tried with

Rdd pipe Subprocess exit code

2019-01-18 Thread Mkal
When using rdd pipe(script), i get the following error : "java.lang.IllegalStateException: Subprocess exited with status 132. Command ran: "./script -h" I'm getting this while trying to run my external script with a simple "-h" argument to test that its running smoothly through my Spark code.

Re: Question about RDD pipe

2019-01-18 Thread Mkal
Thanks a lot for the answer! It solved my problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Question about RDD pipe

2019-01-17 Thread Mkal
Hi, im trying to run an external script on spark using rdd.pipe() and although it runs successfully on standalone, it throws an error on cluster. The error comes from the executors and it's : "Cannot run program "path/to/program": error=2, No such file or directory". Does the external script need