org.apache.spark.util.Vector is deprecated what next ?

2014-04-10 Thread techaddict
org.apache.spark.util.Vector is deprecated so what should be done to use say
if want to create a vector with zeros, def zeros(length: Int) in util.Vector
using new mllib.linalg.Vector ?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/org-apache-spark-util-Vector-is-deprecated-what-next-tp6288.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: minor optimizations to get my feet wet

2014-04-10 Thread Reynold Xin
Thanks for contributing!

I think often unless the feature is gigantic, you can send a pull request
directly for discussion. One rule of thumb in the Spark code base is that
we typically prefer readability over conciseness, and thus we tend to avoid
using too much Scala magic or operator overloading.

In this specific case, do you know if using - instead of reverse improve
performance? I personally find it slightly awkward to use underscore right
after negation ...


The tail change looks good to me.

For foldLeft, I agree with you that the old way is more readable (although
less idiomatic scala).




On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas 
ignacio.zendejas...@gmail.com wrote:

 Hi, all -

 First off, I want to say that I love spark and am very excited about
 MLBase. I'd love to contribute now that I have some time, but before I do
 that I'd like to familiarize myself with the process.

 In looking for a few projects and settling on one which I'll discuss in
 another thread, I found some very minor optimizations I could contribute,
 again, as part of this first step.

 Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
 per the instructions, but I'd still like to have someone quickly glance
 over it and ensure that these are JIRA worthy.

 Commit:

 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402

 To summarize:

 * I got rid of some SeqLike.reverse calls when sorting by descending order
 * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
 more readable .tail calls
 * used a foldleft to avoid using mutable variables in NaiveBayes code

 This last one is meant to understand what's valued more between idiomatic
 Scala development or readability. I'm personally a fan of foldLefts where
 applicable, but do think they're a bit less readable.

 Thanks,
 Ignacio



feature selection and sparse vector support

2014-04-10 Thread Ignacio Zendejas
Hi, again -

As part of the next step, I'd like to make a more substantive contribution
and propose some initial work on feature selection, primarily as it relates
to text classification.

Specifically, I'd like to contribute very straightforward code to perform
information gain feature evaluation. Below's a good primer that shows that
Information Gain is a very good option in many cases. If successful, BNS
(introduced in the paper), would be another approach worth looking into as
it actually improves the f score with a smaller feature space.

http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf

And here's my first cut:
https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8

I don't like that I do two passes to compute the class priors and joint
distributions, so I'll look into using combineByKey as in the NaiveBayes
implementation.  Also, this is still untested code, but it gets my ideas
out there and think it'd be best to define a FeatureEval trait or whatnot
that helps with ranking and selecting.

I also realize the above methods are probably more suitable for MLI than
MLlib, but there doesn't seem to be much activity on the former.

Second, is there a plan to support sparse vector representations for
NaiveBayes. This will probably be more efficient in, for example, text
classification tasks with lots of features (consider the case where n-grams
with n  1 are used).

And on a related note, MLUtils.loadLabeledData doesn't support loading
sparse data. Any plans here to do so? There also doesn't seem to be a
defined file format for MLlib. Has there been any consideration to support
multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
arff, etc?

Thanks for your time,
Ignacio


Re: minor optimizations to get my feet wet

2014-04-10 Thread Henry Saputra
HI Ignacio,

Thank you for your contribution.

Just a friendly reminder, in case you have not contributed to Apache
Software Foundation projects before please submit ASF ICLA form [1] or
if you are sponsored by your company also ask the company to send CCLA
[2] to clear the intellectual property for your contributions.

You can ignore preferred Apache id section for now.


Thank you,

Henry Saputra

[1] https://www.apache.org/licenses/icla.txt
[2] http://www.apache.org/licenses/cla-corporate.txt


On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 Hi, all -

 First off, I want to say that I love spark and am very excited about
 MLBase. I'd love to contribute now that I have some time, but before I do
 that I'd like to familiarize myself with the process.

 In looking for a few projects and settling on one which I'll discuss in
 another thread, I found some very minor optimizations I could contribute,
 again, as part of this first step.

 Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
 per the instructions, but I'd still like to have someone quickly glance
 over it and ensure that these are JIRA worthy.

 Commit:
 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402

 To summarize:

 * I got rid of some SeqLike.reverse calls when sorting by descending order
 * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
 more readable .tail calls
 * used a foldleft to avoid using mutable variables in NaiveBayes code

 This last one is meant to understand what's valued more between idiomatic
 Scala development or readability. I'm personally a fan of foldLefts where
 applicable, but do think they're a bit less readable.

 Thanks,
 Ignacio


Re: minor optimizations to get my feet wet

2014-04-10 Thread Ignacio Zendejas
I don't think there's a noticeable performance hit by the use of reverse in
those cases. It was a quick set of changes and it helped understand what
you look for. I didn't intend to nitpick, so I'll leave as is. I could have
used a scala.Ordering implicitly/explicitly also, but seems overkill and
don't want to necessarily start a discussion about what's best--unless one
of the admins deems this important.

I'll only keep the use of take and tail over using slice and switch over to
math.min where indicated.

This after I follow Henry's timely advice--thanks, Henry.

cheers.




On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for contributing!

 I think often unless the feature is gigantic, you can send a pull request
 directly for discussion. One rule of thumb in the Spark code base is that
 we typically prefer readability over conciseness, and thus we tend to avoid
 using too much Scala magic or operator overloading.

 In this specific case, do you know if using - instead of reverse improve
 performance? I personally find it slightly awkward to use underscore right
 after negation ...


 The tail change looks good to me.

 For foldLeft, I agree with you that the old way is more readable (although
 less idiomatic scala).




 On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Hi, all -
 
  First off, I want to say that I love spark and am very excited about
  MLBase. I'd love to contribute now that I have some time, but before I do
  that I'd like to familiarize myself with the process.
 
  In looking for a few projects and settling on one which I'll discuss in
  another thread, I found some very minor optimizations I could contribute,
  again, as part of this first step.
 
  Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
  per the instructions, but I'd still like to have someone quickly glance
  over it and ensure that these are JIRA worthy.
 
  Commit:
 
 
 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402
 
  To summarize:
 
  * I got rid of some SeqLike.reverse calls when sorting by descending
 order
  * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
  more readable .tail calls
  * used a foldleft to avoid using mutable variables in NaiveBayes code
 
  This last one is meant to understand what's valued more between idiomatic
  Scala development or readability. I'm personally a fan of foldLefts where
  applicable, but do think they're a bit less readable.
 
  Thanks,
  Ignacio
 



Re: minor optimizations to get my feet wet

2014-04-10 Thread Henry Saputra
You are welcome, thanks again for contributing =)

- Henry

On Thu, Apr 10, 2014 at 3:17 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 I don't think there's a noticeable performance hit by the use of reverse in
 those cases. It was a quick set of changes and it helped understand what
 you look for. I didn't intend to nitpick, so I'll leave as is. I could have
 used a scala.Ordering implicitly/explicitly also, but seems overkill and
 don't want to necessarily start a discussion about what's best--unless one
 of the admins deems this important.

 I'll only keep the use of take and tail over using slice and switch over to
 math.min where indicated.

 This after I follow Henry's timely advice--thanks, Henry.

 cheers.




 On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote:

 Thanks for contributing!

 I think often unless the feature is gigantic, you can send a pull request
 directly for discussion. One rule of thumb in the Spark code base is that
 we typically prefer readability over conciseness, and thus we tend to avoid
 using too much Scala magic or operator overloading.

 In this specific case, do you know if using - instead of reverse improve
 performance? I personally find it slightly awkward to use underscore right
 after negation ...


 The tail change looks good to me.

 For foldLeft, I agree with you that the old way is more readable (although
 less idiomatic scala).




 On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas 
 ignacio.zendejas...@gmail.com wrote:

  Hi, all -
 
  First off, I want to say that I love spark and am very excited about
  MLBase. I'd love to contribute now that I have some time, but before I do
  that I'd like to familiarize myself with the process.
 
  In looking for a few projects and settling on one which I'll discuss in
  another thread, I found some very minor optimizations I could contribute,
  again, as part of this first step.
 
  Before I initiate a PR, I've gone ahead and tested style, ran tests, etc
  per the instructions, but I'd still like to have someone quickly glance
  over it and ensure that these are JIRA worthy.
 
  Commit:
 
 
 https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402
 
  To summarize:
 
  * I got rid of some SeqLike.reverse calls when sorting by descending
 order
  * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and
  more readable .tail calls
  * used a foldleft to avoid using mutable variables in NaiveBayes code
 
  This last one is meant to understand what's valued more between idiomatic
  Scala development or readability. I'm personally a fan of foldLefts where
  applicable, but do think they're a bit less readable.
 
  Thanks,
  Ignacio
 



RFC: varargs in Logging.scala?

2014-04-10 Thread Marcelo Vanzin
Hey there,

While going through the try to get the hang of things, I've noticed
several different styles of logging. They all have some downside
(readability being one of them in certain cases), but all of the
suffer from the fact that the log message needs to be built even
though it might not be used.

I spent some time trying to add varargs support to Logging.scala (also
to learn more about Scala itself), and came up with this:
https://github.com/vanzin/spark/commit/a15c284d4aac3d645b13c0ef157787ba014840e4

The change may look large, but the only interesting changes are in
Logging.scala, I promise.

What do you guys think of this approach?

It should, at worst, be just as fast (or slow) as before for the
majority of cases (i.e., any case where variables were used in the log
message). Personally, I think it reads better.

It might be possible to have something similar using string
interpolation, but I'm not familiar enough with Scala yet to try my
hand at that. Also, I believe that would still require some kind of
formatting when you want to do calculations (e.g. turn a variable
holding milliseconds into seconds in the log message).

If people like it, I'll submit a proper pull request. I've run a few
things using this code, and also the tests (which caught a few type
mismatches in the format strings), and everything looks ok so far.

-- 
Marcelo


Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
Hi Marcelo,

Thanks for bringing this up here, as this has been a topic of debate
recently.  Some thoughts below.

... all of the suffer from the fact that the log message needs to be built
 even
 though it might not be used.


This is not true of the current implementation (and this is actually why
Spark has a logging trait instead of just using a logger directly.)

If you look at the original function signatures:

protected def logDebug(msg: = String) ...


The = implies that we are passing the msg by name instead of by value.
Under the covers, scala is creating a closure that can be used to calculate
the log message, only if its actually required.  This does result is a
significant performance improvement, but still requires allocating an
object for the closure.  The bytecode is really something like this:

val logMessage = new Function0() { def call() =  Log message +
someExpensiveComputation() }
log.debug(logMessage)


In Catalyst and Spark SQL we are using the scala-logging package, which
uses macros to automatically rewrite all of your log statements.

You write: logger.debug(sLog message $someExpensiveComputation)

You get:

if(logger.debugEnabled) {
  val logMsg = Log message + someExpensiveComputation()
  logger.debug(logMsg)
}

IMHO, this is the cleanest option (and is supported by Typesafe).  Based on
a micro-benchmark, it is also the fastest:

std logging: 19885.48ms
spark logging 914.408ms
scala logging 729.779ms

Once the dust settles from the 1.0 release, I'd be in favor of
standardizing on scala-logging.

Michael


Re: RFC: varargs in Logging.scala?

2014-04-10 Thread Michael Armbrust
BTW...

You can do calculations in string interpolation:
sTime: ${timeMillis / 1000}s

Or use format strings.
fFloat with two decimal places: $floatValue%.2f

More info:
http://docs.scala-lang.org/overviews/core/string-interpolation.html


On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.comwrote:

 Hi Marcelo,

 Thanks for bringing this up here, as this has been a topic of debate
 recently.  Some thoughts below.

 ... all of the suffer from the fact that the log message needs to be built
 even

 though it might not be used.


 This is not true of the current implementation (and this is actually why
 Spark has a logging trait instead of just using a logger directly.)

 If you look at the original function signatures:

 protected def logDebug(msg: = String) ...


 The = implies that we are passing the msg by name instead of by value.
 Under the covers, scala is creating a closure that can be used to calculate
 the log message, only if its actually required.  This does result is a
 significant performance improvement, but still requires allocating an
 object for the closure.  The bytecode is really something like this:

 val logMessage = new Function0() { def call() =  Log message + 
 someExpensiveComputation() }
 log.debug(logMessage)


 In Catalyst and Spark SQL we are using the scala-logging package, which
 uses macros to automatically rewrite all of your log statements.

 You write: logger.debug(sLog message $someExpensiveComputation)

 You get:

 if(logger.debugEnabled) {
   val logMsg = Log message + someExpensiveComputation()
   logger.debug(logMsg)
 }

 IMHO, this is the cleanest option (and is supported by Typesafe).  Based
 on a micro-benchmark, it is also the fastest:

 std logging: 19885.48ms
 spark logging 914.408ms
 scala logging 729.779ms

 Once the dust settles from the 1.0 release, I'd be in favor of
 standardizing on scala-logging.

 Michael



Building Spark AMI

2014-04-10 Thread Jim Ancona
Are there scripts to build the AMI used by the spark-ec2 script?

Alternatively, is there a place to download the AMI. I'm interested in
using it to deploy into an internal Openstack cloud.

Thanks,

Jim


Re: feature selection and sparse vector support

2014-04-10 Thread Xiangrui Meng
Hi Ignacio,

Please create a JIRA and send a PR for the information gain
computation, so it is easy to track the progress.

The sparse vector support for NaiveBayes is already implemented in
branch-1.0 and master. You only need to provide an RDD of sparse
vectors (created from Vectors.sparse).

MLUtils.loadLibSVMData reads sparse features in LIBSVM format.

Best,
Xiangrui

On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
 Hi, again -

 As part of the next step, I'd like to make a more substantive contribution
 and propose some initial work on feature selection, primarily as it relates
 to text classification.

 Specifically, I'd like to contribute very straightforward code to perform
 information gain feature evaluation. Below's a good primer that shows that
 Information Gain is a very good option in many cases. If successful, BNS
 (introduced in the paper), would be another approach worth looking into as
 it actually improves the f score with a smaller feature space.

 http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf

 And here's my first cut:
 https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8

 I don't like that I do two passes to compute the class priors and joint
 distributions, so I'll look into using combineByKey as in the NaiveBayes
 implementation.  Also, this is still untested code, but it gets my ideas
 out there and think it'd be best to define a FeatureEval trait or whatnot
 that helps with ranking and selecting.

 I also realize the above methods are probably more suitable for MLI than
 MLlib, but there doesn't seem to be much activity on the former.

 Second, is there a plan to support sparse vector representations for
 NaiveBayes. This will probably be more efficient in, for example, text
 classification tasks with lots of features (consider the case where n-grams
 with n  1 are used).

 And on a related note, MLUtils.loadLabeledData doesn't support loading
 sparse data. Any plans here to do so? There also doesn't seem to be a
 defined file format for MLlib. Has there been any consideration to support
 multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
 arff, etc?

 Thanks for your time,
 Ignacio