I submitted a PR for standardizing the text format for vectors and
labeled data: https://github.com/apache/spark/pull/685

Once it gets merged, saveAsTextFile and loading should be consistent.
I didn't choose LibSVM as the default format because two reasons:

1) It doesn't contain feature dimension info in the record. We need to
scan the dataset to get that info.
2) It saves index:value tuples. Putting indices together can help data
compression. Same for value if there are many binary features.

Best,
Xiangrui

On Wed, May 7, 2014 at 10:25 PM, Debasish Das <debasish.da...@gmail.com> wrote:
> Hi,
>
> I see ALS is still using Array[Int] but for other mllib algorithm we moved
> to Vector[Double] so that it can support either dense and sparse formats...
>
> ALS can stay in Array[Int] due to the Netflix format for input datasets
> which is well defined but it helps if we move ALS to Vector[Double] as
> well...that way all algorithms will be consistent...
>
> The second issue is that toString on SparseVector does not write libsvm
> format but something not very generic...can we change the
> SparseVector.toString to write as libsvm output ? I am dumping a sample of
> dataset to see how mllib glm compares with the glmnet-R package for QoR...
>
> Thanks.
> Deb
>
> On Mon, May 5, 2014 at 4:05 PM, David Hall <d...@cs.berkeley.edu> wrote:
>>
>>> On Mon, May 5, 2014 at 3:40 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>
>>> > David,
>>> >
>>> > Could we use Int, Long, Float as the data feature spaces, and Double for
>>> > optimizer?
>>> >
>>>
>>> Yes. Breeze doesn't allow operations on mixed types, so you'd need to
>>> convert the double vectors to Floats if you wanted, e.g. dot product with
>>> the weights vector.
>>>
>>> You might also be interested in FeatureVector, which is just a wrapper
>>> around Array[Int] that emulates an indicator vector. It supports dot
>>> products, axpy, etc.
>>>
>>> -- David
>>>
>>>
>>> >
>>> >
>>> > Sincerely,
>>> >
>>> > DB Tsai
>>> > -------------------------------------------------------
>>> > My Blog: https://www.dbtsai.com
>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >
>>> >
>>> > On Mon, May 5, 2014 at 3:06 PM, David Hall <d...@cs.berkeley.edu>
>>> wrote:
>>> >
>>> > > Lbfgs and other optimizers would not work immediately, as they require
>>> > > vector spaces over double. Otherwise it should work.
>>> > > On May 5, 2014 3:03 PM, "DB Tsai" <dbt...@stanford.edu> wrote:
>>> > >
>>> > > > Breeze could take any type (Int, Long, Double, and Float) in the
>>> matrix
>>> > > > template.
>>> > > >
>>> > > >
>>> > > > Sincerely,
>>> > > >
>>> > > > DB Tsai
>>> > > > -------------------------------------------------------
>>> > > > My Blog: https://www.dbtsai.com
>>> > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> > > >
>>> > > >
>>> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das <
>>> debasish.da...@gmail.com
>>> > > > >wrote:
>>> > > >
>>> > > > > Is this a breeze issue or breeze can take templates on float /
>>> > double ?
>>> > > > >
>>> > > > > If breeze can take templates then it is a minor fix for
>>> Vectors.scala
>>> > > > right
>>> > > > > ?
>>> > > > >
>>> > > > > Thanks.
>>> > > > > Deb
>>> > > > >
>>> > > > >
>>> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai <dbt...@stanford.edu>
>>> wrote:
>>> > > > >
>>> > > > > > +1  Would be nice that we can use different type in Vector.
>>> > > > > >
>>> > > > > >
>>> > > > > > Sincerely,
>>> > > > > >
>>> > > > > > DB Tsai
>>> > > > > > -------------------------------------------------------
>>> > > > > > My Blog: https://www.dbtsai.com
>>> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> > > > > >
>>> > > > > >
>>> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das <
>>> > > debasish.da...@gmail.com
>>> > > > > > >wrote:
>>> > > > > >
>>> > > > > > > Hi,
>>> > > > > > >
>>> > > > > > > Why mllib vector is using double as default ?
>>> > > > > > >
>>> > > > > > > /**
>>> > > > > > >
>>> > > > > > >  * Represents a numeric vector, whose index type is Int and
>>> value
>>> > > > type
>>> > > > > is
>>> > > > > > > Double.
>>> > > > > > >
>>> > > > > > >  */
>>> > > > > > >
>>> > > > > > > trait Vector extends Serializable {
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >   /**
>>> > > > > > >
>>> > > > > > >    * Size of the vector.
>>> > > > > > >
>>> > > > > > >    */
>>> > > > > > >
>>> > > > > > >   def size: Int
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >   /**
>>> > > > > > >
>>> > > > > > >    * Converts the instance to a double array.
>>> > > > > > >
>>> > > > > > >    */
>>> > > > > > >
>>> > > > > > >   def toArray: Array[Double]
>>> > > > > > >
>>> > > > > > > Don't we need a template on float/double ? This will give us
>>> > memory
>>> > > > > > > savings...
>>> > > > > > >
>>> > > > > > > Thanks.
>>> > > > > > >
>>> > > > > > > Deb
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Reply via email to