Re: mllib vector templates
Hi, I see ALS is still using Array[Int] but for other mllib algorithm we moved to Vector[Double] so that it can support either dense and sparse formats... ALS can stay in Array[Int] due to the Netflix format for input datasets which is well defined but it helps if we move ALS to Vector[Double] as well...that way all algorithms will be consistent... The second issue is that toString on SparseVector does not write libsvm format but something not very generic...can we change the SparseVector.toString to write as libsvm output ? I am dumping a sample of dataset to see how mllib glm compares with the glmnet-R package for QoR... Thanks. Deb On Mon, May 5, 2014 at 4:05 PM, David Hall wrote: > >> On Mon, May 5, 2014 at 3:40 PM, DB Tsai wrote: >> >> > David, >> > >> > Could we use Int, Long, Float as the data feature spaces, and Double for >> > optimizer? >> > >> >> Yes. Breeze doesn't allow operations on mixed types, so you'd need to >> convert the double vectors to Floats if you wanted, e.g. dot product with >> the weights vector. >> >> You might also be interested in FeatureVector, which is just a wrapper >> around Array[Int] that emulates an indicator vector. It supports dot >> products, axpy, etc. >> >> -- David >> >> >> > >> > >> > Sincerely, >> > >> > DB Tsai >> > --- >> > My Blog: https://www.dbtsai.com >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> > >> > On Mon, May 5, 2014 at 3:06 PM, David Hall >> wrote: >> > >> > > Lbfgs and other optimizers would not work immediately, as they require >> > > vector spaces over double. Otherwise it should work. >> > > On May 5, 2014 3:03 PM, "DB Tsai" wrote: >> > > >> > > > Breeze could take any type (Int, Long, Double, and Float) in the >> matrix >> > > > template. >> > > > >> > > > >> > > > Sincerely, >> > > > >> > > > DB Tsai >> > > > --- >> > > > My Blog: https://www.dbtsai.com >> > > > LinkedIn: https://www.linkedin.com/in/dbtsai >> > > > >> > > > >> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das < >> debasish.da...@gmail.com >> > > > >wrote: >> > > > >> > > > > Is this a breeze issue or breeze can take templates on float / >> > double ? >> > > > > >> > > > > If breeze can take templates then it is a minor fix for >> Vectors.scala >> > > > right >> > > > > ? >> > > > > >> > > > > Thanks. >> > > > > Deb >> > > > > >> > > > > >> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai >> wrote: >> > > > > >> > > > > > +1 Would be nice that we can use different type in Vector. >> > > > > > >> > > > > > >> > > > > > Sincerely, >> > > > > > >> > > > > > DB Tsai >> > > > > > --- >> > > > > > My Blog: https://www.dbtsai.com >> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai >> > > > > > >> > > > > > >> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das < >> > > debasish.da...@gmail.com >> > > > > > >wrote: >> > > > > > >> > > > > > > Hi, >> > > > > > > >> > > > > > > Why mllib vector is using double as default ? >> > > > > > > >> > > > > > > /** >> > > > > > > >> > > > > > > * Represents a numeric vector, whose index type is Int and >> value >> > > > type >> > > > > is >> > > > > > > Double. >> > > > > > > >> > > > > > > */ >> > > > > > > >> > > > > > > trait Vector extends Serializable { >> > > > > > > >> > > > > > > >> > > > > > > /** >> > > > > > > >> > > > > > >* Size of the vector. >> > > > > > > >> > > > > > >*/ >> > > > > > > >> > > > > > > def size: Int >> > > > > > > >> > > > > > > >> > > > > > > /** >> > > > > > > >> > > > > > >* Converts the instance to a double array. >> > > > > > > >> > > > > > >*/ >> > > > > > > >> > > > > > > def toArray: Array[Double] >> > > > > > > >> > > > > > > Don't we need a template on float/double ? This will give us >> > memory >> > > > > > > savings... >> > > > > > > >> > > > > > > Thanks. >> > > > > > > >> > > > > > > Deb >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
Re: Updating docs for running on Mesos
Andrew, Updating these docs would be great! I think this would be a welcome change. In terms of packaging, it would be good to mention the binaries produced by the upstream project as well, in addition to Mesosphere. - Patrick On Thu, May 8, 2014 at 12:51 AM, Andrew Ash wrote: > The docs for how to run Spark on Mesos have changed very little since > 0.6.0, but setting it up is much easier now than then. Does it make sense > to revamp with the below changes? > > > You no longer need to build mesos yourself as pre-built versions are > available from Mesosphere: http://mesosphere.io/downloads/ > > And the instructions guide you towards compiling your own distribution of > Spark, when you can use the prebuilt versions of Spark as well. > > > I'd like to split that portion of the documentation into two sections, a > build-from-scratch section and a use-prebuilt section. The new outline > would look something like this: > > > *Running Spark on Mesos* > > Installing Mesos > - using prebuilt (recommended) > - pointer to mesosphere's packages > - from scratch > - (similar to current) > > > Connecting Spark to Mesos > - loading distribution into an accessible location > - Spark settings > > Mesos Run Modes > - (same as current) > > Running Alongside Hadoop > - (trim this down) > > > > Does that work for people? > > > Thanks! > Andrew > > > PS Basically all the same: > > http://spark.apache.org/docs/0.6.0/running-on-mesos.html > http://spark.apache.org/docs/0.6.2/running-on-mesos.html > http://spark.apache.org/docs/0.7.3/running-on-mesos.html > http://spark.apache.org/docs/0.8.1/running-on-mesos.html > http://spark.apache.org/docs/0.9.1/running-on-mesos.html > https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
Re: Updating docs for running on Mesos
Thanks for suggesting this and volunteering to do it. On May 11, 2014 3:32 AM, "Andrew Ash" wrote: > > The docs for how to run Spark on Mesos have changed very little since > 0.6.0, but setting it up is much easier now than then. Does it make sense > to revamp with the below changes? > > > You no longer need to build mesos yourself as pre-built versions are > available from Mesosphere: http://mesosphere.io/downloads/ > > And the instructions guide you towards compiling your own distribution of > Spark, when you can use the prebuilt versions of Spark as well. > > > I'd like to split that portion of the documentation into two sections, a > build-from-scratch section and a use-prebuilt section. The new outline > would look something like this: > > > *Running Spark on Mesos* > > Installing Mesos > - using prebuilt (recommended) > - pointer to mesosphere's packages > - from scratch > - (similar to current) > > > Connecting Spark to Mesos > - loading distribution into an accessible location > - Spark settings > > Mesos Run Modes > - (same as current) > > Running Alongside Hadoop > - (trim this down) What trimming do you have in mind here? > > > > Does that work for people? > > > Thanks! > Andrew > > > PS Basically all the same: > > http://spark.apache.org/docs/0.6.0/running-on-mesos.html > http://spark.apache.org/docs/0.6.2/running-on-mesos.html > http://spark.apache.org/docs/0.7.3/running-on-mesos.html > http://spark.apache.org/docs/0.8.1/running-on-mesos.html > http://spark.apache.org/docs/0.9.1/running-on-mesos.html > https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html
Re: Spark on Scala 2.11
i believe matei has said before that he would like to crossbuild for 2.10 and 2.11, given that the difference is not as big as between 2.9 and 2.10. but dont know when this would happen... On Sat, May 10, 2014 at 11:02 PM, Gary Malouf wrote: > Considering the team just bumped to 2.10 in 0.9, I would be surprised if > this is a near term priority. > > > On Thu, May 8, 2014 at 9:33 PM, Anand Avati wrote: > > > Is there an ongoing effort (or intent) to support Spark on Scala 2.11? > > Approximate timeline? > > > > Thanks > > >
Re: Spark on Scala 2.11
We do want to support it eventually, possibly as early as Spark 1.1 (which we’d cross-build on Scala 2.10 and 2.11). If someone wants to look at it before, feel free to do so! Scala 2.11 is very close to 2.10 so I think things will mostly work, except for possibly the REPL (which has require porting over code form the Scala REPL in each version). Matei On May 8, 2014, at 6:33 PM, Anand Avati wrote: > Is there an ongoing effort (or intent) to support Spark on Scala 2.11? > Approximate timeline? > > Thanks
LabeledPoint dump LibSVM if SparseVector
Hi, I need to change the toString on LabeledPoint to libsvm format so that I can dump RDD[LabeledPoint] as a format that could be read by sparse glmnet-R and other packages to benchmark mllib classification accuracy... Basically I have to change the toString of LabeledPoint and toString of SparseVector Should I add it as a PR or is it already being added ? I added these functions toLibSvm in my internal util class for now... def toLibSvm(labelPoint: LabeledPoint): String = { labelPoint.label.toString + " " + toLibSvm(labelPoint.features.asInstanceOf[SparseVector]) } def toLibSvm(features: SparseVector): String = { val indices = features.indices val values = features.values indices.zip(values).mkString(" ").replace(',', ':').replace("(", "").replace(")","") } Thanks. Deb