I submitted a PR for standardizing the text format for vectors and labeled data: https://github.com/apache/spark/pull/685
Once it gets merged, saveAsTextFile and loading should be consistent. I didn't choose LibSVM as the default format because two reasons: 1) It doesn't contain feature dimension info in the record. We need to scan the dataset to get that info. 2) It saves index:value tuples. Putting indices together can help data compression. Same for value if there are many binary features. Best, Xiangrui On Wed, May 7, 2014 at 10:25 PM, Debasish Das <debasish.da...@gmail.com> wrote: > Hi, > > I see ALS is still using Array[Int] but for other mllib algorithm we moved > to Vector[Double] so that it can support either dense and sparse formats... > > ALS can stay in Array[Int] due to the Netflix format for input datasets > which is well defined but it helps if we move ALS to Vector[Double] as > well...that way all algorithms will be consistent... > > The second issue is that toString on SparseVector does not write libsvm > format but something not very generic...can we change the > SparseVector.toString to write as libsvm output ? I am dumping a sample of > dataset to see how mllib glm compares with the glmnet-R package for QoR... > > Thanks. > Deb > > On Mon, May 5, 2014 at 4:05 PM, David Hall <d...@cs.berkeley.edu> wrote: >> >>> On Mon, May 5, 2014 at 3:40 PM, DB Tsai <dbt...@stanford.edu> wrote: >>> >>> > David, >>> > >>> > Could we use Int, Long, Float as the data feature spaces, and Double for >>> > optimizer? >>> > >>> >>> Yes. Breeze doesn't allow operations on mixed types, so you'd need to >>> convert the double vectors to Floats if you wanted, e.g. dot product with >>> the weights vector. >>> >>> You might also be interested in FeatureVector, which is just a wrapper >>> around Array[Int] that emulates an indicator vector. It supports dot >>> products, axpy, etc. >>> >>> -- David >>> >>> >>> > >>> > >>> > Sincerely, >>> > >>> > DB Tsai >>> > ------------------------------------------------------- >>> > My Blog: https://www.dbtsai.com >>> > LinkedIn: https://www.linkedin.com/in/dbtsai >>> > >>> > >>> > On Mon, May 5, 2014 at 3:06 PM, David Hall <d...@cs.berkeley.edu> >>> wrote: >>> > >>> > > Lbfgs and other optimizers would not work immediately, as they require >>> > > vector spaces over double. Otherwise it should work. >>> > > On May 5, 2014 3:03 PM, "DB Tsai" <dbt...@stanford.edu> wrote: >>> > > >>> > > > Breeze could take any type (Int, Long, Double, and Float) in the >>> matrix >>> > > > template. >>> > > > >>> > > > >>> > > > Sincerely, >>> > > > >>> > > > DB Tsai >>> > > > ------------------------------------------------------- >>> > > > My Blog: https://www.dbtsai.com >>> > > > LinkedIn: https://www.linkedin.com/in/dbtsai >>> > > > >>> > > > >>> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das < >>> debasish.da...@gmail.com >>> > > > >wrote: >>> > > > >>> > > > > Is this a breeze issue or breeze can take templates on float / >>> > double ? >>> > > > > >>> > > > > If breeze can take templates then it is a minor fix for >>> Vectors.scala >>> > > > right >>> > > > > ? >>> > > > > >>> > > > > Thanks. >>> > > > > Deb >>> > > > > >>> > > > > >>> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai <dbt...@stanford.edu> >>> wrote: >>> > > > > >>> > > > > > +1 Would be nice that we can use different type in Vector. >>> > > > > > >>> > > > > > >>> > > > > > Sincerely, >>> > > > > > >>> > > > > > DB Tsai >>> > > > > > ------------------------------------------------------- >>> > > > > > My Blog: https://www.dbtsai.com >>> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai >>> > > > > > >>> > > > > > >>> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das < >>> > > debasish.da...@gmail.com >>> > > > > > >wrote: >>> > > > > > >>> > > > > > > Hi, >>> > > > > > > >>> > > > > > > Why mllib vector is using double as default ? >>> > > > > > > >>> > > > > > > /** >>> > > > > > > >>> > > > > > > * Represents a numeric vector, whose index type is Int and >>> value >>> > > > type >>> > > > > is >>> > > > > > > Double. >>> > > > > > > >>> > > > > > > */ >>> > > > > > > >>> > > > > > > trait Vector extends Serializable { >>> > > > > > > >>> > > > > > > >>> > > > > > > /** >>> > > > > > > >>> > > > > > > * Size of the vector. >>> > > > > > > >>> > > > > > > */ >>> > > > > > > >>> > > > > > > def size: Int >>> > > > > > > >>> > > > > > > >>> > > > > > > /** >>> > > > > > > >>> > > > > > > * Converts the instance to a double array. >>> > > > > > > >>> > > > > > > */ >>> > > > > > > >>> > > > > > > def toArray: Array[Double] >>> > > > > > > >>> > > > > > > Don't we need a template on float/double ? This will give us >>> > memory >>> > > > > > > savings... >>> > > > > > > >>> > > > > > > Thanks. >>> > > > > > > >>> > > > > > > Deb >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >>