I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with following memory settings for my project: JAVA_OPTS="-Xmx8G -Xms2G" . My data is organized in 20 json-like files, every file is about 8-15 Mb, containing categorical and numerical values. I parse this data, passing by DataFrame facilities and then scale one numerical feature and create dummy variables for categorical features. So far from initial 14 keys of my json-like file I get about 200-240 features in the final LabeledPoint. The final data is sparse and every file contains about 20000-30000 of observations. I try to run two types of algorithms on data : LinearRegressionWithSGD or LassoWithSGD, since the data is sparse and regularization might be required. The following questions were emerged while running my tests: 1. Most IMPORTANT: For data larger than 11MB LinearRegressionWithSGD fails with the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 58 in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in stage 346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 179307 ms I faced the same problem with the file of 11Mb (for file of 5MB algorithm works well), and after trying a lot of debug options (testing different options for driver.memory & executors.memory, making sure that cache is cleared properly, proper use of coalesce() ) I've found out that setting up the StepSize of Gradientt Descent to 1 resolves this bug (while for 5MB file-size StepSize = 0.4 doesn't bug and gives the better results). So I tried to augment the StepSize for file-size of 12MB (setting up of StepSize to 1.5 and 2) but it didn't work. If I take only 10 Mb of file instead of whole file, the algorithm doesn't fail. It's very embarrassing since I need construct the model on whole file, that seems to be still far from Big Data formats. If I can not run Linear Regression on 12 Mb, could I run it on larger sets? I notices that while using StandardScaler on preprocessing step and counts on Linear Regression step, collect() method is perform, that can cause the bug. So the possibility to scale Linear regression is questioned, cause, as I far as I understand it, collect() performs on driver and so the sens of scaled calculations is lost. the following parameters are set : val algorithme = new LinearRegressionWithSGD() //LassoWithSGD() algorithme.setIntercept(true) algorithme.optimizer .setNumIterations(100) .setStepSize(1)
2. The second question concern the creation of SparkContext. Since I haven't new the nature of the bug mentioned above, I tried to create the SparkContext for every file and not for ensemble of files. Till now, I don't know the best option, if there is a sens to create the SparkContext for every file of 10-20 Mb, or it can pass over 20 files ot that size (i.e. iterate over 20 files inside the SparkContext) 3. The solutions on using coalesce() is not very clear. For now I apply coalesce(sqlContext.sparkContext.defaultParallelism) only for one dataframe (the biggest one) but should I call this method again on smaller dataframes? Can this method increase performance in my case (i.e. 14 columns and 30000 rows of initial dataframe) 4. The question of using multiple collect(). While parsing json file , applying collect() seems to be inevitable for obtaining LabeledPoint in the end, and I'd like to know, in which cases I should avoid it and in which cases there is no danger in using collect() Thank you in advance for your help, and let me know if you need the sample of data to reproduce the bug if my description is not sufficient -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/LinearRegressionWithSGD-fails-on-12Mb-data-tp26942.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org