Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Jerry, Thanks for your comment. On Mon, Sep 4, 2017 at 10:43 AM, Saisai Shao wrote: > I think spark.yarn.am.port is not used any more, so you don't need to > consider this. > > If you're running Spark on YARN, I think some YARN RM port to submit > applications should

Re: java heap space

2017-09-03 Thread 周康
May be you can repartition? 2017-09-04 9:25 GMT+08:00 KhajaAsmath Mohammed : > Hi, > > I am getting java.lang.OutOfMemoryError: Java heap space error whenever I > ran the spark sql job. > > I came to conclusion issue is because of reading number of files from > spark. >

Re: Port to open for submitting Spark on Yarn application

2017-09-03 Thread Saisai Shao
I think spark.yarn.am.port is not used any more, so you don't need to consider this. If you're running Spark on YARN, I think some YARN RM port to submit applications should also be reachable via firewall, as well as HDFS port to upload resources. Also in the Spark side, executors will be

java heap space

2017-09-03 Thread KhajaAsmath Mohammed
Hi, I am getting java.lang.OutOfMemoryError: Java heap space error whenever I ran the spark sql job. I came to conclusion issue is because of reading number of files from spark. I am reading 37 partitions and each partition has around 2000 files with filesize more than 128 MB 37*2000 files

Port to open for submitting Spark on Yarn application

2017-09-03 Thread Satoshi Yamada
Hi, In case we run Spark on Yarn in client mode, we have firewall for Hadoop cluster, and the client node is outside firewall, I think I have to open some ports that Application Master uses. I think the ports is specified by "spark.yarn.am.port" as document says.

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread Timsina, Prem
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second

Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-03 Thread prtimsina
Is there a way to parallelize multiple ML algorithms in Spark. My use case is something like this: A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random Forest, etc.) in parallel. 1) Validate each algorithm using 10-fold cross-validation B) Feed the output of step A) in second

Re: Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Riccardo Ferrari
Hi Aakash, What I see in the picture seems correct. Spark (pyspark) is reading your F2 cell as a multi-line text. Where are the nulls you're referring to? You might find the pyspark.sql.functions.regexp_replace

[SS] How to know what events were late in a streaming batch?

2017-09-03 Thread Jacek Laskowski
Hi, I've asked this question on SO [1], but hope to catch more attention posting here. I'd like to know how many events were late in a streaming batch in Structured Streaming. Is there a way to know the number or (better) what events exactly were late? Thanks for any help you may offer! [1]

Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Aakash Basu
Hi, I've a dataset where a few rows of the column F as shown below have line breaks in CSV file. [image: Inline image 1] When Spark is reading it, it is coming as below, which is a complete new line. [image: Inline image 2] I want my PySpark 2.1.0 to read it by forcefully avoiding the line