Re: Building Spark AMI

2014-04-11 Thread Jim Ancona
Hi,

Right now my use case is setting up a small cluster for
prototyping/evaluation. My hope was that I could use the scripts that
come with Spark to get things up and running quickly. For a production
deploy we would probably roll our own using Puppet.

Jim

On Fri, Apr 11, 2014 at 7:58 PM, Mayur Rustagi  wrote:
> I am creating one fully configured & synced one. But you still need to send
> over configuration. Do you plan to use chef for that ?
>  On Apr 10, 2014 6:58 PM, "Jim Ancona"  wrote:
>
>> Are there scripts to build the AMI used by the spark-ec2 script?
>>
>> Alternatively, is there a place to download the AMI. I'm interested in
>> using it to deploy into an internal Openstack cloud.
>>
>> Thanks,
>>
>> Jim
>>


Re: Building Spark AMI

2014-04-11 Thread Mayur Rustagi
I am creating one fully configured & synced one. But you still need to send
over configuration. Do you plan to use chef for that ?
 On Apr 10, 2014 6:58 PM, "Jim Ancona"  wrote:

> Are there scripts to build the AMI used by the spark-ec2 script?
>
> Alternatively, is there a place to download the AMI. I'm interested in
> using it to deploy into an internal Openstack cloud.
>
> Thanks,
>
> Jim
>


It seems that jenkins for PR is not working

2014-04-11 Thread DB Tsai
I always got
=

Could not find Apache license headers in the following files:
 !? /root/workspace/SparkPullRequestBuilder/python/metastore/db.lck
 !? 
/root/workspace/SparkPullRequestBuilder/python/metastore/service.properties


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


Re: feature selection and sparse vector support

2014-04-11 Thread Ignacio Zendejas
Here's the JIRA:
https://issues.apache.org/jira/browse/SPARK-1473

Future discussions should take place in its comments section.

Thanks.




On Fri, Apr 11, 2014 at 11:26 AM, Ignacio Zendejas <
ignacio.zendejas...@gmail.com> wrote:

> Thanks for the response, Xiangrui.
>
> And sounds good, Héctor. Look forward to working on this together.
>
> A common interface is definitely required.  I'll create a JIRA shortly and
> will explore design options myself to bring ideas to the table.
>
> cheers.
>
>
>
> On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín 
> wrote:
>
>> Hi,
>>
>> Regarding the implementation of feature selection techniques, I'm
>> implementing some iterative algorithms based on a paper by Gavin Brown et
>> al. [1]. In this paper, he proposes a common framework for many
>> Information
>> Theory-based criteria, namely those that use relevancy (mutual information
>> between one feature and the label; Information Gain), redundancy, and
>> conditional redundancy. The latter two are differently interpreted
>> depending on the criteria, but all of them play with the mutual
>> information
>> between the feature being analyzed and the already selected ones and the
>> same mutual information conditioned to the label.
>>
>> I think we should have a common interface to plug different Feature
>> Selection techniques. I already have the algorithm implemented, but still
>> have to do tests on it. Right now I'm working on the design. Next week I
>> can share with you a proposal, so we can work together to bring Feature
>> Selection to Spark.
>>
>> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
>> likelihood maximisation: a unifying framework for information theoretic
>> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
>>
>> ---
>> Héctor
>>
>>
>> On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng  wrote:
>>
>> > Hi Ignacio,
>> >
>> > Please create a JIRA and send a PR for the information gain
>> > computation, so it is easy to track the progress.
>> >
>> > The sparse vector support for NaiveBayes is already implemented in
>> > branch-1.0 and master. You only need to provide an RDD of sparse
>> > vectors (created from Vectors.sparse).
>> >
>> > MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
>> >  wrote:
>> > > Hi, again -
>> > >
>> > > As part of the next step, I'd like to make a more substantive
>> > contribution
>> > > and propose some initial work on feature selection, primarily as it
>> > relates
>> > > to text classification.
>> > >
>> > > Specifically, I'd like to contribute very straightforward code to
>> perform
>> > > information gain feature evaluation. Below's a good primer that shows
>> > that
>> > > Information Gain is a very good option in many cases. If successful,
>> BNS
>> > > (introduced in the paper), would be another approach worth looking
>> into
>> > as
>> > > it actually improves the f score with a smaller feature space.
>> > >
>> > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
>> > >
>> > > And here's my first cut:
>> > >
>> >
>> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
>> > >
>> > > I don't like that I do two passes to compute the class priors and
>> joint
>> > > distributions, so I'll look into using combineByKey as in the
>> NaiveBayes
>> > > implementation.  Also, this is still untested code, but it gets my
>> ideas
>> > > out there and think it'd be best to define a FeatureEval trait or
>> whatnot
>> > > that helps with ranking and selecting.
>> > >
>> > > I also realize the above methods are probably more suitable for MLI
>> than
>> > > MLlib, but there doesn't seem to be much activity on the former.
>> > >
>> > > Second, is there a plan to support sparse vector representations for
>> > > NaiveBayes. This will probably be more efficient in, for example, text
>> > > classification tasks with lots of features (consider the case where
>> > n-grams
>> > > with n > 1 are used).
>> > >
>> > > And on a related note, MLUtils.loadLabeledData doesn't support loading
>> > > sparse data. Any plans here to do so? There also doesn't seem to be a
>> > > defined file format for MLlib. Has there been any consideration to
>> > support
>> > > multiple standard formats, rather than defining one: eg, csv, tsv,
>> Weka's
>> > > arff, etc?
>> > >
>> > > Thanks for your time,
>> > > Ignacio
>> >
>>
>
>


Re: Suggestion

2014-04-11 Thread Sandy Ryza
Hi Priya,

Here's a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

-Sandy


On Fri, Apr 11, 2014 at 12:05 PM, priya arora wrote:

> Hi,
>
> May I know how one can contribute in this project
> http://spark.apache.org/mllib/ or in any other project. I am very eager to
> contribute. Do let me know.
>
> Thanks & Regards,
> Priya Arora
>


Suggestion

2014-04-11 Thread priya arora
Hi,

May I know how one can contribute in this project
http://spark.apache.org/mllib/ or in any other project. I am very eager to
contribute. Do let me know.

Thanks & Regards,
Priya Arora


Re: feature selection and sparse vector support

2014-04-11 Thread Ignacio Zendejas
Thanks for the response, Xiangrui.

And sounds good, Héctor. Look forward to working on this together.

A common interface is definitely required.  I'll create a JIRA shortly and
will explore design options myself to bring ideas to the table.

cheers.



On Fri, Apr 11, 2014 at 5:44 AM, Héctor Mouriño-Talín wrote:

> Hi,
>
> Regarding the implementation of feature selection techniques, I'm
> implementing some iterative algorithms based on a paper by Gavin Brown et
> al. [1]. In this paper, he proposes a common framework for many Information
> Theory-based criteria, namely those that use relevancy (mutual information
> between one feature and the label; Information Gain), redundancy, and
> conditional redundancy. The latter two are differently interpreted
> depending on the criteria, but all of them play with the mutual information
> between the feature being analyzed and the already selected ones and the
> same mutual information conditioned to the label.
>
> I think we should have a common interface to plug different Feature
> Selection techniques. I already have the algorithm implemented, but still
> have to do tests on it. Right now I'm working on the design. Next week I
> can share with you a proposal, so we can work together to bring Feature
> Selection to Spark.
>
> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
>
> ---
> Héctor
>
>
> On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng  wrote:
>
> > Hi Ignacio,
> >
> > Please create a JIRA and send a PR for the information gain
> > computation, so it is easy to track the progress.
> >
> > The sparse vector support for NaiveBayes is already implemented in
> > branch-1.0 and master. You only need to provide an RDD of sparse
> > vectors (created from Vectors.sparse).
> >
> > MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
> >
> > Best,
> > Xiangrui
> >
> > On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
> >  wrote:
> > > Hi, again -
> > >
> > > As part of the next step, I'd like to make a more substantive
> > contribution
> > > and propose some initial work on feature selection, primarily as it
> > relates
> > > to text classification.
> > >
> > > Specifically, I'd like to contribute very straightforward code to
> perform
> > > information gain feature evaluation. Below's a good primer that shows
> > that
> > > Information Gain is a very good option in many cases. If successful,
> BNS
> > > (introduced in the paper), would be another approach worth looking into
> > as
> > > it actually improves the f score with a smaller feature space.
> > >
> > > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
> > >
> > > And here's my first cut:
> > >
> >
> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
> > >
> > > I don't like that I do two passes to compute the class priors and joint
> > > distributions, so I'll look into using combineByKey as in the
> NaiveBayes
> > > implementation.  Also, this is still untested code, but it gets my
> ideas
> > > out there and think it'd be best to define a FeatureEval trait or
> whatnot
> > > that helps with ranking and selecting.
> > >
> > > I also realize the above methods are probably more suitable for MLI
> than
> > > MLlib, but there doesn't seem to be much activity on the former.
> > >
> > > Second, is there a plan to support sparse vector representations for
> > > NaiveBayes. This will probably be more efficient in, for example, text
> > > classification tasks with lots of features (consider the case where
> > n-grams
> > > with n > 1 are used).
> > >
> > > And on a related note, MLUtils.loadLabeledData doesn't support loading
> > > sparse data. Any plans here to do so? There also doesn't seem to be a
> > > defined file format for MLlib. Has there been any consideration to
> > support
> > > multiple standard formats, rather than defining one: eg, csv, tsv,
> Weka's
> > > arff, etc?
> > >
> > > Thanks for your time,
> > > Ignacio
> >
>


Re: RFC: varargs in Logging.scala?

2014-04-11 Thread David Hall
Another usage that's nice is:

logDebug {
   val timeS = timeMillis/1000.0
   s"Time: $timeS"
}

which can be useful for more complicated expressions.


On Thu, Apr 10, 2014 at 5:55 PM, Michael Armbrust wrote:

> BTW...
>
> You can do calculations in string interpolation:
> s"Time: ${timeMillis / 1000}s"
>
> Or use format strings.
> f"Float with two decimal places: $floatValue%.2f"
>
> More info:
> http://docs.scala-lang.org/overviews/core/string-interpolation.html
>
>
> On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust  >wrote:
>
> > Hi Marcelo,
> >
> > Thanks for bringing this up here, as this has been a topic of debate
> > recently.  Some thoughts below.
> >
> > ... all of the suffer from the fact that the log message needs to be
> built
> >> even
> >>
> >> though it might not be used.
> >>
> >
> > This is not true of the current implementation (and this is actually why
> > Spark has a logging trait instead of just using a logger directly.)
> >
> > If you look at the original function signatures:
> >
> > protected def logDebug(msg: => String) ...
> >
> >
> > The => implies that we are passing the msg by name instead of by value.
> > Under the covers, scala is creating a closure that can be used to
> calculate
> > the log message, only if its actually required.  This does result is a
> > significant performance improvement, but still requires allocating an
> > object for the closure.  The bytecode is really something like this:
> >
> > val logMessage = new Function0() { def call() =  "Log message" +
> someExpensiveComputation() }
> > log.debug(logMessage)
> >
> >
> > In Catalyst and Spark SQL we are using the scala-logging package, which
> > uses macros to automatically rewrite all of your log statements.
> >
> > You write: logger.debug(s"Log message $someExpensiveComputation")
> >
> > You get:
> >
> > if(logger.debugEnabled) {
> >   val logMsg = "Log message" + someExpensiveComputation()
> >   logger.debug(logMsg)
> > }
> >
> > IMHO, this is the cleanest option (and is supported by Typesafe).  Based
> > on a micro-benchmark, it is also the fastest:
> >
> > std logging: 19885.48ms
> > spark logging 914.408ms
> > scala logging 729.779ms
> >
> > Once the dust settles from the 1.0 release, I'd be in favor of
> > standardizing on scala-logging.
> >
> > Michael
> >
>


Re: RFC: varargs in Logging.scala?

2014-04-11 Thread Marcelo Vanzin
On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust
 wrote:
> ... all of the suffer from the fact that the log message needs to be built
>> even
>> though it might not be used.
>
> This is not true of the current implementation (and this is actually why
> Spark has a logging trait instead of just using a logger directly.)
>
> If you look at the original function signatures:
>
> protected def logDebug(msg: => String) ...
>
>
> The => implies that we are passing the msg by name instead of by value.
> Under the covers, scala is creating a closure that can be used to calculate
> the log message, only if its actually required.

Hah. Interesting. Guess it's my noob Scala hat showing off.

I saw the PR about using scala-logging before, but didn't pay too
close attention to it.

Thanks for the info guys!

-- 
Marcelo


Re: feature selection and sparse vector support

2014-04-11 Thread Héctor Mouriño-Talín
Hi,

Regarding the implementation of feature selection techniques, I'm
implementing some iterative algorithms based on a paper by Gavin Brown et
al. [1]. In this paper, he proposes a common framework for many Information
Theory-based criteria, namely those that use relevancy (mutual information
between one feature and the label; Information Gain), redundancy, and
conditional redundancy. The latter two are differently interpreted
depending on the criteria, but all of them play with the mutual information
between the feature being analyzed and the already selected ones and the
same mutual information conditioned to the label.

I think we should have a common interface to plug different Feature
Selection techniques. I already have the algorithm implemented, but still
have to do tests on it. Right now I'm working on the design. Next week I
can share with you a proposal, so we can work together to bring Feature
Selection to Spark.

[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.

---
Héctor


On Fri, Apr 11, 2014 at 5:20 AM, Xiangrui Meng  wrote:

> Hi Ignacio,
>
> Please create a JIRA and send a PR for the information gain
> computation, so it is easy to track the progress.
>
> The sparse vector support for NaiveBayes is already implemented in
> branch-1.0 and master. You only need to provide an RDD of sparse
> vectors (created from Vectors.sparse).
>
> MLUtils.loadLibSVMData reads sparse features in LIBSVM format.
>
> Best,
> Xiangrui
>
> On Thu, Apr 10, 2014 at 5:18 PM, Ignacio Zendejas
>  wrote:
> > Hi, again -
> >
> > As part of the next step, I'd like to make a more substantive
> contribution
> > and propose some initial work on feature selection, primarily as it
> relates
> > to text classification.
> >
> > Specifically, I'd like to contribute very straightforward code to perform
> > information gain feature evaluation. Below's a good primer that shows
> that
> > Information Gain is a very good option in many cases. If successful, BNS
> > (introduced in the paper), would be another approach worth looking into
> as
> > it actually improves the f score with a smaller feature space.
> >
> > http://machinelearning.wustl.edu/mlpapers/paper_files/Forman03.pdf
> >
> > And here's my first cut:
> >
> https://github.com/izendejas/spark/commit/e5a0620838841c99865ffa4fb0d2b449751236a8
> >
> > I don't like that I do two passes to compute the class priors and joint
> > distributions, so I'll look into using combineByKey as in the NaiveBayes
> > implementation.  Also, this is still untested code, but it gets my ideas
> > out there and think it'd be best to define a FeatureEval trait or whatnot
> > that helps with ranking and selecting.
> >
> > I also realize the above methods are probably more suitable for MLI than
> > MLlib, but there doesn't seem to be much activity on the former.
> >
> > Second, is there a plan to support sparse vector representations for
> > NaiveBayes. This will probably be more efficient in, for example, text
> > classification tasks with lots of features (consider the case where
> n-grams
> > with n > 1 are used).
> >
> > And on a related note, MLUtils.loadLabeledData doesn't support loading
> > sparse data. Any plans here to do so? There also doesn't seem to be a
> > defined file format for MLlib. Has there been any consideration to
> support
> > multiple standard formats, rather than defining one: eg, csv, tsv, Weka's
> > arff, etc?
> >
> > Thanks for your time,
> > Ignacio
>