Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak wrote: > Hi All, > > > I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB > sparse). I have now filtered the data and I am running regression on a > subset of it (~ 200 MB). I see this error, which is strange since it was > running fine with the superset data. Is this a formatting issue (which I > doubt) or is this some other issue in data preparation? I confirmed that > there is no empty line in my dataset. Any help with this will be highly > appreciated. > > > 14/12/08 20:32:03 WARN TaskSetManager: Lost TID 5 (task 3.0:1) > > 14/12/08 20:32:03 WARN TaskSetManager: Loss was due to > java.lang.ArrayIndexOutOfBoundsException > > java.lang.ArrayIndexOutOfBoundsException: 150323 > > at > breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231) > > at > breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216) > > at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60) > > at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391) > > at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83) > > at breeze.linalg.DenseVector.dot(DenseVector.scala:47) > > at > org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125) > > at > org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180) > > at > org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179) > > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > > at > scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) > > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) > > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) > > at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) > > at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) > > at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838) > > at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) > > at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) > > at org.apache.spark.scheduler.Task.run(Task.scala:51) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLLib Linear regression
The proper step size partially depends on the Lipschitz constant of the objective. You should let the machine try different combinations of parameters and select the best. We are working with people from AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the theory, Nesterov's book "Introductory Lectures on Convex Optimization" is a good one. We didn't use line search in the current implementation of LinearRegression, which we should definitely add that option in the future. Best, Xiangrui On Wed, Oct 8, 2014 at 7:21 AM, Sameer Tilak wrote: > Hi Xiangrui, > Changing the default step size to 0.01 made a huge difference. The results > make sense when I use A + B + C + D. MSE is ~0.07 and the outcome matches > the domain knowledge. > > I was wondering is there any documentation on the parameters and when/how to > vary them. > >> Date: Tue, 7 Oct 2014 15:11:39 -0700 >> Subject: Re: MLLib Linear regression >> From: men...@gmail.com >> To: ssti...@live.com >> CC: user@spark.apache.org > >> >> Did you test different regularization parameters and step sizes? In >> the combination that works, I don't see "A + D". Did you test that >> combination? Are there any linear dependency between A's columns and >> D's columns? -Xiangrui >> >> On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote: >> > BTW, one detail: >> > >> > When number of iterations is 100 all weights are zero or below and the >> > indices are only from set A. >> > >> > When number of iterations is 150 I see 30+ non-zero weights (when sorted >> > by >> > weight) and indices are distributed across al sets. however MSE is high >> > (5.xxx) and the result does not match the domain knowledge. >> > >> > When number of iterations is 400 I see 30+ non-zero weights (when sorted >> > by >> > weight) and indices are distributed across al sets. however MSE is high >> > (6.xxx) and the result does not match the domain knowledge. >> > >> > Any help will be highly appreciated. >> > >> > >> > >> > From: ssti...@live.com >> > To: user@spark.apache.org >> > Subject: MLLib Linear regression >> > Date: Tue, 7 Oct 2014 13:41:03 -0700 >> > >> > >> > Hi All, >> > I have following classes of features: >> > >> > class A: 15000 features >> > class B: 170 features >> > class C: 900 features >> > Class D: 6000 features. >> > >> > I use linear regression (over sparse data). I get excellent results with >> > low >> > RMSE (~0.06) for the following combinations of classes: >> > 1. A + B + C >> > 2. B + C + D >> > 3. A + B >> > 4. A + C >> > 5. B + D >> > 6. C + D >> > 7. D >> > >> > Unfortunately, when I use A + B + C + D (all the features) I get results >> > that don't make any sense -- all weights are zero or below and the >> > indices >> > are only from set A. I also get high MSE. I changed the number of >> > iterations >> > from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there >> > any >> > other parameters that I can play with? Any insight on what could be >> > wrong? >> > Is it somehow it is not able to scale up to 22K features? (I highly >> > doubt >> > that). >> > >> > >> > >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: MLLib Linear regression
Hi Xiangrui,Changing the default step size to 0.01 made a huge difference. The results make sense when I use A + B + C + D. MSE is ~0.07 and the outcome matches the domain knowledge. I was wondering is there any documentation on the parameters and when/how to vary them. > Date: Tue, 7 Oct 2014 15:11:39 -0700 > Subject: Re: MLLib Linear regression > From: men...@gmail.com > To: ssti...@live.com > CC: user@spark.apache.org > > Did you test different regularization parameters and step sizes? In > the combination that works, I don't see "A + D". Did you test that > combination? Are there any linear dependency between A's columns and > D's columns? -Xiangrui > > On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote: > > BTW, one detail: > > > > When number of iterations is 100 all weights are zero or below and the > > indices are only from set A. > > > > When number of iterations is 150 I see 30+ non-zero weights (when sorted by > > weight) and indices are distributed across al sets. however MSE is high > > (5.xxx) and the result does not match the domain knowledge. > > > > When number of iterations is 400 I see 30+ non-zero weights (when sorted by > > weight) and indices are distributed across al sets. however MSE is high > > (6.xxx) and the result does not match the domain knowledge. > > > > Any help will be highly appreciated. > > > > > > > > From: ssti...@live.com > > To: user@spark.apache.org > > Subject: MLLib Linear regression > > Date: Tue, 7 Oct 2014 13:41:03 -0700 > > > > > > Hi All, > > I have following classes of features: > > > > class A: 15000 features > > class B: 170 features > > class C: 900 features > > Class D: 6000 features. > > > > I use linear regression (over sparse data). I get excellent results with low > > RMSE (~0.06) for the following combinations of classes: > > 1. A + B + C > > 2. B + C + D > > 3. A + B > > 4. A + C > > 5. B + D > > 6. C + D > > 7. D > > > > Unfortunately, when I use A + B + C + D (all the features) I get results > > that don't make any sense -- all weights are zero or below and the indices > > are only from set A. I also get high MSE. I changed the number of iterations > > from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any > > other parameters that I can play with? Any insight on what could be wrong? > > Is it somehow it is not able to scale up to 22K features? (I highly doubt > > that). > > > > > > > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
Re: MLLib Linear regression
Did you test different regularization parameters and step sizes? In the combination that works, I don't see "A + D". Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote: > BTW, one detail: > > When number of iterations is 100 all weights are zero or below and the > indices are only from set A. > > When number of iterations is 150 I see 30+ non-zero weights (when sorted by > weight) and indices are distributed across al sets. however MSE is high > (5.xxx) and the result does not match the domain knowledge. > > When number of iterations is 400 I see 30+ non-zero weights (when sorted by > weight) and indices are distributed across al sets. however MSE is high > (6.xxx) and the result does not match the domain knowledge. > > Any help will be highly appreciated. > > > > From: ssti...@live.com > To: user@spark.apache.org > Subject: MLLib Linear regression > Date: Tue, 7 Oct 2014 13:41:03 -0700 > > > Hi All, > I have following classes of features: > > class A: 15000 features > class B: 170 features > class C: 900 features > Class D: 6000 features. > > I use linear regression (over sparse data). I get excellent results with low > RMSE (~0.06) for the following combinations of classes: > 1. A + B + C > 2. B + C + D > 3. A + B > 4. A + C > 5. B + D > 6. C + D > 7. D > > Unfortunately, when I use A + B + C + D (all the features) I get results > that don't make any sense -- all weights are zero or below and the indices > are only from set A. I also get high MSE. I changed the number of iterations > from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any > other parameters that I can play with? Any insight on what could be wrong? > Is it somehow it is not able to scale up to 22K features? (I highly doubt > that). > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: MLLib Linear regression
BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does not match the domain knowledge. When number of iterations is 400 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (6.xxx) and the result does not match the domain knowledge. Any help will be highly appreciated. From: ssti...@live.com To: user@spark.apache.org Subject: MLLib Linear regression Date: Tue, 7 Oct 2014 13:41:03 -0700 Hi All,I have following classes of features: class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + D3. A + B4. A + C5. B + D6. C + D7. D Unfortunately, when I use A + B + C + D (all the features) I get results that don't make any sense -- all weights are zero or below and the indices are only from set A. I also get high MSE. I changed the number of iterations from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any other parameters that I can play with? Any insight on what could be wrong? Is it somehow it is not able to scale up to 22K features? (I highly doubt that).
Re: MLlib Linear Regression Mismatch
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) ! Cheers On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz wrote: > Hi, > > It appears that the step size is too high that the model is diverging with > the added noise. > Could you try by setting the step size to be 0.1 or 0.01? > > Best, > Burak > > - Original Message - > From: "Krishna Sankar" > To: user@spark.apache.org > Sent: Wednesday, October 1, 2014 12:43:20 PM > Subject: MLlib Linear Regression Mismatch > > Guys, >Obviously I am doing something wrong. May be 4 points are too small a > dataset. > Can you help me to figure out why the following doesn't work ? > a) This works : > > data = [ >LabeledPoint(0.0, [0.0]), >LabeledPoint(10.0, [10.0]), >LabeledPoint(20.0, [20.0]), >LabeledPoint(30.0, [30.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([40]) > > output: > > > [ 1.] > 0.0 > > 40.0 > > b) By perturbing the y a little bit, the model gives wrong results: > > data = [ >LabeledPoint(0.0, [0.0]), >LabeledPoint(9.0, [10.0]), >LabeledPoint(22.0, [20.0]), >LabeledPoint(32.0, [30.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) # should be 1.09x -0.60 > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([40]) > > Output: > > > [ -8.20487463e+203] > 0.0 > > -3.2819498532740317e+205 > > c) Same story here - wrong results. Actually nan: > > data = [ >LabeledPoint(18.9, [3910.0]), >LabeledPoint(17.0, [3860.0]), >LabeledPoint(20.0, [4200.0]), >LabeledPoint(16.6, [3660.0]) > ] > lrm = LinearRegressionWithSGD.train(sc.parallelize(data), > initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 > print lrm > print lrm.weights > print lrm.intercept > lrm.predict([4000]) > > Output: 0x109666b90> > > [ nan] > 0.0 > > nan > > Cheers & Thanks > > >
Re: MLlib Linear Regression Mismatch
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: "Krishna Sankar" To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subject: MLlib Linear Regression Mismatch Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) print lrm print lrm.weights print lrm.intercept lrm.predict([40]) output: [ 1.] 0.0 40.0 b) By perturbing the y a little bit, the model gives wrong results: data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(9.0, [10.0]), LabeledPoint(22.0, [20.0]), LabeledPoint(32.0, [30.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be 1.09x -0.60 print lrm print lrm.weights print lrm.intercept lrm.predict([40]) Output: [ -8.20487463e+203] 0.0 -3.2819498532740317e+205 c) Same story here - wrong results. Actually nan: data = [ LabeledPoint(18.9, [3910.0]), LabeledPoint(17.0, [3860.0]), LabeledPoint(20.0, [4200.0]), LabeledPoint(16.6, [3660.0]) ] lrm = LinearRegressionWithSGD.train(sc.parallelize(data), initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170 print lrm print lrm.weights print lrm.intercept lrm.predict([4000]) Output: [ nan] 0.0 nan Cheers & Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org