Re: mllib vector templates

2014-05-11 Thread Debasish Das
Hi,

I see ALS is still using Array[Int] but for other mllib algorithm we moved
to Vector[Double] so that it can support either dense and sparse formats...

ALS can stay in Array[Int] due to the Netflix format for input datasets
which is well defined but it helps if we move ALS to Vector[Double] as
well...that way all algorithms will be consistent...

The second issue is that toString on SparseVector does not write libsvm
format but something not very generic...can we change the
SparseVector.toString to write as libsvm output ? I am dumping a sample of
dataset to see how mllib glm compares with the glmnet-R package for QoR...

Thanks.
Deb

On Mon, May 5, 2014 at 4:05 PM, David Hall  wrote:
>
>> On Mon, May 5, 2014 at 3:40 PM, DB Tsai  wrote:
>>
>> > David,
>> >
>> > Could we use Int, Long, Float as the data feature spaces, and Double for
>> > optimizer?
>> >
>>
>> Yes. Breeze doesn't allow operations on mixed types, so you'd need to
>> convert the double vectors to Floats if you wanted, e.g. dot product with
>> the weights vector.
>>
>> You might also be interested in FeatureVector, which is just a wrapper
>> around Array[Int] that emulates an indicator vector. It supports dot
>> products, axpy, etc.
>>
>> -- David
>>
>>
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > ---
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Mon, May 5, 2014 at 3:06 PM, David Hall 
>> wrote:
>> >
>> > > Lbfgs and other optimizers would not work immediately, as they require
>> > > vector spaces over double. Otherwise it should work.
>> > > On May 5, 2014 3:03 PM, "DB Tsai"  wrote:
>> > >
>> > > > Breeze could take any type (Int, Long, Double, and Float) in the
>> matrix
>> > > > template.
>> > > >
>> > > >
>> > > > Sincerely,
>> > > >
>> > > > DB Tsai
>> > > > ---
>> > > > My Blog: https://www.dbtsai.com
>> > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> > > >
>> > > >
>> > > > On Mon, May 5, 2014 at 2:56 PM, Debasish Das <
>> debasish.da...@gmail.com
>> > > > >wrote:
>> > > >
>> > > > > Is this a breeze issue or breeze can take templates on float /
>> > double ?
>> > > > >
>> > > > > If breeze can take templates then it is a minor fix for
>> Vectors.scala
>> > > > right
>> > > > > ?
>> > > > >
>> > > > > Thanks.
>> > > > > Deb
>> > > > >
>> > > > >
>> > > > > On Mon, May 5, 2014 at 2:45 PM, DB Tsai 
>> wrote:
>> > > > >
>> > > > > > +1  Would be nice that we can use different type in Vector.
>> > > > > >
>> > > > > >
>> > > > > > Sincerely,
>> > > > > >
>> > > > > > DB Tsai
>> > > > > > ---
>> > > > > > My Blog: https://www.dbtsai.com
>> > > > > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> > > > > >
>> > > > > >
>> > > > > > On Mon, May 5, 2014 at 2:41 PM, Debasish Das <
>> > > debasish.da...@gmail.com
>> > > > > > >wrote:
>> > > > > >
>> > > > > > > Hi,
>> > > > > > >
>> > > > > > > Why mllib vector is using double as default ?
>> > > > > > >
>> > > > > > > /**
>> > > > > > >
>> > > > > > >  * Represents a numeric vector, whose index type is Int and
>> value
>> > > > type
>> > > > > is
>> > > > > > > Double.
>> > > > > > >
>> > > > > > >  */
>> > > > > > >
>> > > > > > > trait Vector extends Serializable {
>> > > > > > >
>> > > > > > >
>> > > > > > >   /**
>> > > > > > >
>> > > > > > >* Size of the vector.
>> > > > > > >
>> > > > > > >*/
>> > > > > > >
>> > > > > > >   def size: Int
>> > > > > > >
>> > > > > > >
>> > > > > > >   /**
>> > > > > > >
>> > > > > > >* Converts the instance to a double array.
>> > > > > > >
>> > > > > > >*/
>> > > > > > >
>> > > > > > >   def toArray: Array[Double]
>> > > > > > >
>> > > > > > > Don't we need a template on float/double ? This will give us
>> > memory
>> > > > > > > savings...
>> > > > > > >
>> > > > > > > Thanks.
>> > > > > > >
>> > > > > > > Deb
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>


Re: Updating docs for running on Mesos

2014-05-11 Thread Patrick Wendell
Andrew,

Updating these docs would be great! I think this would be a welcome change.

In terms of packaging, it would be good to mention the binaries
produced by the upstream project as well, in addition to Mesosphere.

- Patrick

On Thu, May 8, 2014 at 12:51 AM, Andrew Ash  wrote:
> The docs for how to run Spark on Mesos have changed very little since
> 0.6.0, but setting it up is much easier now than then.  Does it make sense
> to revamp with the below changes?
>
>
> You no longer need to build mesos yourself as pre-built versions are
> available from Mesosphere: http://mesosphere.io/downloads/
>
> And the instructions guide you towards compiling your own distribution of
> Spark, when you can use the prebuilt versions of Spark as well.
>
>
> I'd like to split that portion of the documentation into two sections, a
> build-from-scratch section and a use-prebuilt section.  The new outline
> would look something like this:
>
>
> *Running Spark on Mesos*
>
> Installing Mesos
> - using prebuilt (recommended)
>  - pointer to mesosphere's packages
> - from scratch
>  - (similar to current)
>
>
> Connecting Spark to Mesos
> - loading distribution into an accessible location
> - Spark settings
>
> Mesos Run Modes
> - (same as current)
>
> Running Alongside Hadoop
> - (trim this down)
>
>
>
> Does that work for people?
>
>
> Thanks!
> Andrew
>
>
> PS Basically all the same:
>
> http://spark.apache.org/docs/0.6.0/running-on-mesos.html
> http://spark.apache.org/docs/0.6.2/running-on-mesos.html
> http://spark.apache.org/docs/0.7.3/running-on-mesos.html
> http://spark.apache.org/docs/0.8.1/running-on-mesos.html
> http://spark.apache.org/docs/0.9.1/running-on-mesos.html
> https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html


Re: Updating docs for running on Mesos

2014-05-11 Thread Andy Konwinski
Thanks for suggesting this and volunteering to do it.

On May 11, 2014 3:32 AM, "Andrew Ash"  wrote:
>
> The docs for how to run Spark on Mesos have changed very little since
> 0.6.0, but setting it up is much easier now than then.  Does it make sense
> to revamp with the below changes?
>
>
> You no longer need to build mesos yourself as pre-built versions are
> available from Mesosphere: http://mesosphere.io/downloads/
>
> And the instructions guide you towards compiling your own distribution of
> Spark, when you can use the prebuilt versions of Spark as well.
>
>
> I'd like to split that portion of the documentation into two sections, a
> build-from-scratch section and a use-prebuilt section.  The new outline
> would look something like this:
>
>
> *Running Spark on Mesos*
>
> Installing Mesos
> - using prebuilt (recommended)
>  - pointer to mesosphere's packages
> - from scratch
>  - (similar to current)
>
>
> Connecting Spark to Mesos
> - loading distribution into an accessible location
> - Spark settings
>
> Mesos Run Modes
> - (same as current)
>
> Running Alongside Hadoop
> - (trim this down)

What trimming do you have in mind here?

>
>
>
> Does that work for people?
>
>
> Thanks!
> Andrew
>
>
> PS Basically all the same:
>
> http://spark.apache.org/docs/0.6.0/running-on-mesos.html
> http://spark.apache.org/docs/0.6.2/running-on-mesos.html
> http://spark.apache.org/docs/0.7.3/running-on-mesos.html
> http://spark.apache.org/docs/0.8.1/running-on-mesos.html
> http://spark.apache.org/docs/0.9.1/running-on-mesos.html
>
https://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/running-on-mesos.html


Re: Spark on Scala 2.11

2014-05-11 Thread Koert Kuipers
i believe matei has said before that he would like to crossbuild for 2.10
and 2.11, given that the difference is not as big as between 2.9 and 2.10.
but dont know when this would happen...


On Sat, May 10, 2014 at 11:02 PM, Gary Malouf  wrote:

> Considering the team just bumped to 2.10 in 0.9, I would be surprised if
> this is a near term priority.
>
>
> On Thu, May 8, 2014 at 9:33 PM, Anand Avati  wrote:
>
> > Is there an ongoing effort (or intent) to support Spark on Scala 2.11?
> > Approximate timeline?
> >
> > Thanks
> >
>


Re: Spark on Scala 2.11

2014-05-11 Thread Matei Zaharia
We do want to support it eventually, possibly as early as Spark 1.1 (which we’d 
cross-build on Scala 2.10 and 2.11). If someone wants to look at it before, 
feel free to do so! Scala 2.11 is very close to 2.10 so I think things will 
mostly work, except for possibly the REPL (which has require porting over code 
form the Scala REPL in each version).

Matei

On May 8, 2014, at 6:33 PM, Anand Avati  wrote:

> Is there an ongoing effort (or intent) to support Spark on Scala 2.11?
> Approximate timeline?
> 
> Thanks



LabeledPoint dump LibSVM if SparseVector

2014-05-11 Thread Debasish Das
Hi,

I need to change the toString on LabeledPoint to libsvm format so that I
can dump RDD[LabeledPoint] as a format that could be read by sparse
glmnet-R and other packages to benchmark mllib classification accuracy...

Basically I have to change the toString of LabeledPoint and toString of
SparseVector

Should I add it as a PR or is it already being added ?

I added these functions toLibSvm in my internal util class for now...

def toLibSvm(labelPoint: LabeledPoint): String = {

labelPoint.label.toString + " " +
toLibSvm(labelPoint.features.asInstanceOf[SparseVector])

  }

  def toLibSvm(features: SparseVector): String = {

val indices = features.indices

val values = features.values

indices.zip(values).mkString("
").replace(',', ':').replace("(", "").replace(")","")

  }
Thanks.
Deb