org.apache.spark.util.Vector is deprecated what next ?
org.apache.spark.util.Vector is deprecated so what should be done to use say if want to create a vector with zeros, def zeros(length: Int) in util.Vector using new mllib.linalg.Vector ? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/org-apache-spark-util-Vector-is-deprecated-what-next-tp6288.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: minor optimizations to get my feet wet
Thanks for contributing! I think often unless the feature is gigantic, you can send a pull request directly for discussion. One rule of thumb in the Spark code base is that we typically prefer readability over conciseness, and thus we tend to avoid using too much Scala magic or operator overloading. In this specific case, do you know if using - instead of reverse improve performance? I personally find it slightly awkward to use underscore right after negation ... The tail change looks good to me. For foldLeft, I agree with you that the old way is more readable (although less idiomatic scala). On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
feature selection and sparse vector support
Hi, again - As part of the next step, I'd like to make a more substantive contribution and propose some initial work on feature selection, primarily as it relates to text classification. Specifically, I'd like to contribute very straightforward code to perform information gain feature evaluation. Below's a good primer that shows that Information Gain is a very good option in many cases. If successful, BNS (introduced in the paper), would be another approach worth looking into as it actually improves the f score with a smaller feature space. http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf And here's my first cut: https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 I don't like that I do two passes to compute the class priors and joint distributions, so I'll look into using combineByKey as in the NaiveBayes implementation. Also, this is still untested code, but it gets my ideas out there and think it'd be best to define a FeatureEval trait or whatnot that helps with ranking and selecting. I also realize the above methods are probably more suitable for MLI than MLlib, but there doesn't seem to be much activity on the former. Second, is there a plan to support sparse vector representations for NaiveBayes. This will probably be more efficient in, for example, text classification tasks with lots of features (consider the case where n-grams with n 1 are used). And on a related note, MLUtils.loadLabeledData doesn't support loading sparse data. Any plans here to do so? There also doesn't seem to be a defined file format for MLlib. Has there been any consideration to support multiple standard formats, rather than defining one: eg, csv, tsv, Weka's arff, etc? Thanks for your time, Ignacio
Re: minor optimizations to get my feet wet
HI Ignacio, Thank you for your contribution. Just a friendly reminder, in case you have not contributed to Apache Software Foundation projects before please submit ASF ICLA form [1] or if you are sponsored by your company also ask the company to send CCLA [2] to clear the intellectual property for your contributions. You can ignore preferred Apache id section for now. Thank you, Henry Saputra [1] https://www.apache.org/licenses/icla.txt [2] http://www.apache.org/licenses/cla-corporate.txt On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
Re: minor optimizations to get my feet wet
I don't think there's a noticeable performance hit by the use of reverse in those cases. It was a quick set of changes and it helped understand what you look for. I didn't intend to nitpick, so I'll leave as is. I could have used a scala.Ordering implicitly/explicitly also, but seems overkill and don't want to necessarily start a discussion about what's best--unless one of the admins deems this important. I'll only keep the use of take and tail over using slice and switch over to math.min where indicated. This after I follow Henry's timely advice--thanks, Henry. cheers. On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote: Thanks for contributing! I think often unless the feature is gigantic, you can send a pull request directly for discussion. One rule of thumb in the Spark code base is that we typically prefer readability over conciseness, and thus we tend to avoid using too much Scala magic or operator overloading. In this specific case, do you know if using - instead of reverse improve performance? I personally find it slightly awkward to use underscore right after negation ... The tail change looks good to me. For foldLeft, I agree with you that the old way is more readable (although less idiomatic scala). On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
Re: minor optimizations to get my feet wet
You are welcome, thanks again for contributing =) - Henry On Thu, Apr 10, 2014 at 3:17 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: I don't think there's a noticeable performance hit by the use of reverse in those cases. It was a quick set of changes and it helped understand what you look for. I didn't intend to nitpick, so I'll leave as is. I could have used a scala.Ordering implicitly/explicitly also, but seems overkill and don't want to necessarily start a discussion about what's best--unless one of the admins deems this important. I'll only keep the use of take and tail over using slice and switch over to math.min where indicated. This after I follow Henry's timely advice--thanks, Henry. cheers. On Thu, Apr 10, 2014 at 2:10 PM, Reynold Xin r...@databricks.com wrote: Thanks for contributing! I think often unless the feature is gigantic, you can send a pull request directly for discussion. One rule of thumb in the Spark code base is that we typically prefer readability over conciseness, and thus we tend to avoid using too much Scala magic or operator overloading. In this specific case, do you know if using - instead of reverse improve performance? I personally find it slightly awkward to use underscore right after negation ... The tail change looks good to me. For foldLeft, I agree with you that the old way is more readable (although less idiomatic scala). On Thu, Apr 10, 2014 at 1:48 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, all - First off, I want to say that I love spark and am very excited about MLBase. I'd love to contribute now that I have some time, but before I do that I'd like to familiarize myself with the process. In looking for a few projects and settling on one which I'll discuss in another thread, I found some very minor optimizations I could contribute, again, as part of this first step. Before I initiate a PR, I've gone ahead and tested style, ran tests, etc per the instructions, but I'd still like to have someone quickly glance over it and ensure that these are JIRA worthy. Commit: https://github.com/izendejas/spark/commit/81065aed9987c1b08cd5784b7a6153e26f3f7402 To summarize: * I got rid of some SeqLike.reverse calls when sorting by descending order * replaced slice(1, length) calls with the much safer (avoids IOOBEs) and more readable .tail calls * used a foldleft to avoid using mutable variables in NaiveBayes code This last one is meant to understand what's valued more between idiomatic Scala development or readability. I'm personally a fan of foldLefts where applicable, but do think they're a bit less readable. Thanks, Ignacio
RFC: varargs in Logging.scala?
Hey there, While going through the try to get the hang of things, I've noticed several different styles of logging. They all have some downside (readability being one of them in certain cases), but all of the suffer from the fact that the log message needs to be built even though it might not be used. I spent some time trying to add varargs support to Logging.scala (also to learn more about Scala itself), and came up with this: https://github.com/vanzin/spark/commit/a15c284d4aac3d645b13c0ef157787ba014840e4 The change may look large, but the only interesting changes are in Logging.scala, I promise. What do you guys think of this approach? It should, at worst, be just as fast (or slow) as before for the majority of cases (i.e., any case where variables were used in the log message). Personally, I think it reads better. It might be possible to have something similar using string interpolation, but I'm not familiar enough with Scala yet to try my hand at that. Also, I believe that would still require some kind of formatting when you want to do calculations (e.g. turn a variable holding milliseconds into seconds in the log message). If people like it, I'll submit a proper pull request. I've run a few things using this code, and also the tests (which caught a few type mismatches in the format strings), and everything looks ok so far. -- Marcelo
Re: RFC: varargs in Logging.scala?
Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually why Spark has a logging trait instead of just using a logger directly.) If you look at the original function signatures: protected def logDebug(msg: = String) ... The = implies that we are passing the msg by name instead of by value. Under the covers, scala is creating a closure that can be used to calculate the log message, only if its actually required. This does result is a significant performance improvement, but still requires allocating an object for the closure. The bytecode is really something like this: val logMessage = new Function0() { def call() = Log message + someExpensiveComputation() } log.debug(logMessage) In Catalyst and Spark SQL we are using the scala-logging package, which uses macros to automatically rewrite all of your log statements. You write: logger.debug(sLog message $someExpensiveComputation) You get: if(logger.debugEnabled) { val logMsg = Log message + someExpensiveComputation() logger.debug(logMsg) } IMHO, this is the cleanest option (and is supported by Typesafe). Based on a micro-benchmark, it is also the fastest: std logging: 19885.48ms spark logging 914.408ms scala logging 729.779ms Once the dust settles from the 1.0 release, I'd be in favor of standardizing on scala-logging. Michael
Re: RFC: varargs in Logging.scala?
BTW... You can do calculations in string interpolation: sTime: ${timeMillis / 1000}s Or use format strings. fFloat with two decimal places: $floatValue%.2f More info: http://docs.scala-lang.org/overviews/core/string-interpolation.html On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.comwrote: Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually why Spark has a logging trait instead of just using a logger directly.) If you look at the original function signatures: protected def logDebug(msg: = String) ... The = implies that we are passing the msg by name instead of by value. Under the covers, scala is creating a closure that can be used to calculate the log message, only if its actually required. This does result is a significant performance improvement, but still requires allocating an object for the closure. The bytecode is really something like this: val logMessage = new Function0() { def call() = Log message + someExpensiveComputation() } log.debug(logMessage) In Catalyst and Spark SQL we are using the scala-logging package, which uses macros to automatically rewrite all of your log statements. You write: logger.debug(sLog message $someExpensiveComputation) You get: if(logger.debugEnabled) { val logMsg = Log message + someExpensiveComputation() logger.debug(logMsg) } IMHO, this is the cleanest option (and is supported by Typesafe). Based on a micro-benchmark, it is also the fastest: std logging: 19885.48ms spark logging 914.408ms scala logging 729.779ms Once the dust settles from the 1.0 release, I'd be in favor of standardizing on scala-logging. Michael
Building Spark AMI
Are there scripts to build the AMI used by the spark-ec2 script? Alternatively, is there a place to download the AMI. I'm interested in using it to deploy into an internal Openstack cloud. Thanks, Jim
Re: feature selection and sparse vector support
Hi Ignacio, Please create a JIRA and send a PR for the information gain computation, so it is easy to track the progress. The sparse vector support for NaiveBayes is already implemented in branch-1.0 and master. You only need to provide an RDD of sparse vectors (created from Vectors.sparse). MLUtils.loadLibSVMData reads sparse features in LIBSVM format. Best, Xiangrui On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Hi, again - As part of the next step, I'd like to make a more substantive contribution and propose some initial work on feature selection, primarily as it relates to text classification. Specifically, I'd like to contribute very straightforward code to perform information gain feature evaluation. Below's a good primer that shows that Information Gain is a very good option in many cases. If successful, BNS (introduced in the paper), would be another approach worth looking into as it actually improves the f score with a smaller feature space. http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf And here's my first cut: https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8 I don't like that I do two passes to compute the class priors and joint distributions, so I'll look into using combineByKey as in the NaiveBayes implementation. Also, this is still untested code, but it gets my ideas out there and think it'd be best to define a FeatureEval trait or whatnot that helps with ranking and selecting. I also realize the above methods are probably more suitable for MLI than MLlib, but there doesn't seem to be much activity on the former. Second, is there a plan to support sparse vector representations for NaiveBayes. This will probably be more efficient in, for example, text classification tasks with lots of features (consider the case where n-grams with n 1 are used). And on a related note, MLUtils.loadLabeledData doesn't support loading sparse data. Any plans here to do so? There also doesn't seem to be a defined file format for MLlib. Has there been any consideration to support multiple standard formats, rather than defining one: eg, csv, tsv, Weka's arff, etc? Thanks for your time, Ignacio