I'm using Spark 1.6.1 along with scala 2.11.7 on my Ubuntu 14.04 with
following memory settings for my project:  JAVA_OPTS="-Xmx8G -Xms2G" . My
data is organized in 20 json-like files, every file is about 8-15 Mb,
containing categorical and numerical values. I parse this data, passing by
DataFrame facilities and then scale one numerical feature and create dummy
variables for categorical features. So far from initial 14 keys of my
json-like file I get about 200-240 features in the final LabeledPoint. The
final data is sparse and every file contains about 20000-30000 of
observations. I try to run two types of algorithms on data : 
LinearRegressionWithSGD or LassoWithSGD, since the data is sparse and
regularization might be required. The following questions were emerged while 
running my tests:
1. Most IMPORTANT: For data larger than 11MB  LinearRegressionWithSGD fails
with the following error: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 58
in stage 346.0 failed 1 times, most recent failure: Lost task 58.0 in stage
346.0 (TID 18140, localhost): ExecutorLostFailure (executor driver exited
caused by one of the running tasks) Reason: Executor heartbeat timed out
after 179307 ms
I faced the same problem with the file of 11Mb (for file of 5MB algorithm
works well), and after trying a lot of debug options (testing different
options for driver.memory & executors.memory, making sure that cache is
cleared properly, proper use of coalesce() ) I've found out that setting up
the StepSize of Gradientt Descent to 1 resolves this bug (while for 5MB
file-size StepSize = 0.4 doesn't bug and gives the better results). 
So I tried to augment the StepSize for file-size of 12MB (setting up of
StepSize to 1.5 and 2) but it didn't work. If I take only 10 Mb of file
instead of whole file, the algorithm doesn't fail. It's very embarrassing
since I need construct the model on whole file, that seems to be still far
from Big Data formats. If I can not run Linear Regression on 12 Mb, could I
run it on larger sets? I notices that while using StandardScaler on
preprocessing step and counts on Linear Regression step, collect() method is
perform, that can cause the bug. So the possibility to scale Linear
regression is questioned, cause, as I far as I understand it, collect()
performs on driver and so the sens of scaled calculations is lost.
the following parameters are set :
     val algorithme = new LinearRegressionWithSGD() //LassoWithSGD() 
    algorithme.setIntercept(true)
    algorithme.optimizer
      .setNumIterations(100)
      .setStepSize(1)

2. The second question concern the creation of SparkContext. Since I haven't
new the nature of the bug mentioned above, I tried to create the
SparkContext for every file and not for ensemble of files. Till now, I don't
know the best option, if there is a sens to create the SparkContext for
every file of 10-20 Mb, or it can pass over 20 files ot that size (i.e.
iterate over 20 files inside the SparkContext)

3. The solutions on using coalesce() is not very clear. For now I apply
coalesce(sqlContext.sparkContext.defaultParallelism) only for one dataframe
(the biggest one) but should I call this method again on smaller dataframes?
Can this method increase performance in my case (i.e. 14 columns and 30000
rows of initial dataframe) 

4. The question of using multiple collect(). While parsing json file ,
applying collect() seems to be inevitable for obtaining LabeledPoint in the
end, and I'd like to know, in which cases I should avoid it and in which
cases there is no danger in using collect()

Thank you in advance for your help, and let me know if you need the sample
of data to reproduce the bug if my description is not sufficient



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LinearRegressionWithSGD-fails-on-12Mb-data-tp26942.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to