Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Matei Zaharia
Yes, I agree that we should close down the existing Google group on Jan 1st. 
While it’s more convenient to use, it’s created confusion. I hope that we can 
get the ASF to support better search interfaces in the future too. I think we 
just have to drive this from within.

The Google Group should be a nice way to make the content searchable from the 
web. We should also see what it takes to make it mirrored on Nabble 
(http://www.nabble.com). I’ve found a lot of information about other projects 
there, and other Apache projects do use it.

Matei

On Dec 19, 2013, at 10:49 PM, Andy Konwinski  wrote:

> I've set up two new unofficial google groups to mirror the Apache Spark user 
> and dev lists:
> 
> https://groups.google.com/forum/#!forum/apache-spark-dev-mirror
> https://groups.google.com/forum/#!forum/apache-spark-user-mirror
> 
> Basically these lists each subscribe to the corresponding Apache list.
> 
> They do not allow folks to subscribe directly to them. Getting emails from 
> the Google Group would offer no advantages that I can think of and we really 
> want to encourage folks to sign up for the official mailing list instead.
> 
> The lists do allow the public to send email to them, which I think might be 
> necessary since the "from:" field for all emails that get distributed via the 
> Apache mailing list is set to the author of the email.
> 
> I think this might be a great compromise. At least we can try this out and 
> see how it goes.
> 
> Matei, can you confirm that Jan 1 is the date we want to turn off the 
> existing spark-users google group?
> 
> We could consider using the existing spark-developers and spark-users google 
> groups instead of the two new ones I just created but I think that it is much 
> more obvious to have the lists include the word mirror in their names.
> 
> The dev list mirror seems to be working, because I see the last couple emails 
> from this thread in it already. I'll confirm and ensure that the user list 
> mirror is working too.
> 
> Thoughts?
> 
> Andy
> 
> P.S. Thanks to Patrick for suggesting this to me originally.
> 
> On Thu, Dec 19, 2013 at 8:46 PM, Aaron Davidson  wrote:
> I'd be fine with one-way mirrors here (Apache threads being reflected in 
> Google groups) -- I have no idea how one is supposed to navigate the Apache 
> list to look for historic threads.
> 
> 
> On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:
> Thanks very much for the prompt and comprehensive reply!  I appreciate the 
> overarching desire to integrate with apache: I'm very happy to hear that 
> there's a move to use the existing groups as mirrors: that will overcome all 
> of my objections: particularly if it's bidirectional! :)
> 
> 
> On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
> Hey Mike,
> 
> As you probably noticed when you CC'd spark-de...@googlegroups.com, that list 
> has already be reconfigured so that it no longer allows posting (and bounces 
> emails sent to it).
> 
> We will be doing the same thing to the spark...@googlegroups.com list too 
> (we'll announce a date for that soon).
> 
> That may sound very frustrating, and you are *not* alone feeling that way. 
> We've had a long conversation with our mentors about this, and I've felt very 
> similar to you, so I'd like to give you background.
> 
> As I'm coming to see it, part of becoming an Apache project is moving the 
> community *fully* over to Apache infrastructure, and more generally the 
> Apache way of organizing the community.
> 
> This applies in both the nuts-and-bolts sense of being on apache infra, but 
> possibly more importantly, it is also a guiding principle and way of thinking.
> 
> In various ways, moving to apache Infra can be a painful process, and IMO the 
> loss of all the great mailing list functionality that comes with using Google 
> Groups is perhaps the most painful step. But basically, the de facto mailing 
> lists need to be the Apache ones, and not Google Groups. The underlying 
> reason is that Apache needs to take full accountability for recording and 
> publishing the mailing lists, it has to be able to institutionally guarantee 
> this. This is because discussion on mailing lists is one of the core things 
> that defines an Apache community. So at a minimum this means Apache owning 
> the master copy of the bits. 
> 
> All that said, we are discussing the possibility of having a google group 
> that subscribes to each list that would provide an easier to use and prettier 
> archive for each list (so far we haven't gotten that to work).
> 
> I hope this was helpful. It has taken me a few years now, and a lot of 
> conversations with experienced (and patient!) Apache mentors, to internalize 
> some of the nuance about "the Apache way". That's why I wanted to share.
> 
> Andy
> 
> On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
> I notice that there are still a lot of active topics in this group: and also 
> activity on the apache mailing list (which is

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Andy Konwinski
I've set up two new unofficial google groups to mirror the Apache Spark
user and dev lists:

https://groups.google.com/forum/#!forum/apache-spark-dev-mirror
https://groups.google.com/forum/#!forum/apache-spark-user-mirror

Basically these lists each subscribe to the corresponding Apache list.

They do not allow folks to subscribe directly to them. Getting emails from
the Google Group would offer no advantages that I can think of and we
really want to encourage folks to sign up for the official mailing list
instead.

The lists do allow the public to send email to them, which I think might be
necessary since the "from:" field for all emails that get distributed via
the Apache mailing list is set to the author of the email.

I think this might be a great compromise. At least we can try this out and
see how it goes.

Matei, can you confirm that Jan 1 is the date we want to turn off the
existing spark-users google group?

We could consider using the existing spark-developers and spark-users
google groups instead of the two new ones I just created but I think that
it is much more obvious to have the lists include the word mirror in their
names.

The dev list mirror seems to be working, because I see the last couple
emails from this thread in it already. I'll confirm and ensure that the
user list mirror is working too.

Thoughts?

Andy

P.S. Thanks to Patrick for suggesting this to me originally.

On Thu, Dec 19, 2013 at 8:46 PM, Aaron Davidson  wrote:

> I'd be fine with one-way mirrors here (Apache threads being reflected in
> Google groups) -- I have no idea how one is supposed to navigate the Apache
> list to look for historic threads.
>
>
> On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:
>
>> Thanks very much for the prompt and comprehensive reply!  I appreciate
>> the overarching desire to integrate with apache: I'm very happy to hear
>> that there's a move to use the existing groups as mirrors: that will
>> overcome all of my objections: particularly if it's bidirectional! :)
>>
>>
>> On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
>>
>>> Hey Mike,
>>>
>>> As you probably noticed when you CC'd spark-de...@googlegroups.com,
>>> that list has already be reconfigured so that it no longer allows posting
>>> (and bounces emails sent to it).
>>>
>>> We will be doing the same thing to the spark...@googlegroups.com list
>>> too (we'll announce a date for that soon).
>>>
>>> That may sound very frustrating, and you are *not* alone feeling that
>>> way. We've had a long conversation with our mentors about this, and I've
>>> felt very similar to you, so I'd like to give you background.
>>>
>>> As I'm coming to see it, part of becoming an Apache project is moving
>>> the community *fully* over to Apache infrastructure, and more generally the
>>> Apache way of organizing the community.
>>>
>>> This applies in both the nuts-and-bolts sense of being on apache infra,
>>> but possibly more importantly, it is also a guiding principle and way of
>>> thinking.
>>>
>>> In various ways, moving to apache Infra can be a painful process, and
>>> IMO the loss of all the great mailing list functionality that comes with
>>> using Google Groups is perhaps the most painful step. But basically, the de
>>> facto mailing lists need to be the Apache ones, and not Google Groups. The
>>> underlying reason is that Apache needs to take full accountability for
>>> recording and publishing the mailing lists, it has to be able to
>>> institutionally guarantee this. This is because discussion on mailing lists
>>> is one of the core things that defines an Apache community. So at a minimum
>>> this means Apache owning the master copy of the bits.
>>>
>>> All that said, we are discussing the possibility of having a google
>>> group that subscribes to each list that would provide an easier to use and
>>> prettier archive for each list (so far we haven't gotten that to work).
>>>
>>> I hope this was helpful. It has taken me a few years now, and a lot of
>>> conversations with experienced (and patient!) Apache mentors, to
>>> internalize some of the nuance about "the Apache way". That's why I wanted
>>> to share.
>>>
>>> Andy
>>>
>>> On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
>>>
 I notice that there are still a lot of active topics in this group: and
 also activity on the apache mailing list (which is a really horrible
 experience!).  Is it a firm policy on apache's front to disallow external
 groups?  I'm going to be ramping up on spark, and I really hate the idea of
 having to rely on the apache archives and my mail client.  Also: having to
 search for topics/keywords both in old threads (here) as well as new
 threads in apache's (clunky) archive, is going to be a pain!  I almost feel
 like I must be missing something because the current solution seems
 unfeasibly awkward!

  --
 You received this message because you are subscribed to the Google
 Groups "Spark Use

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Ted Yu
You may have noticed that the counter of searchable items for last 7 days on 
search-Hadoop is 0 and the counter for last 30 days is declining quickly. 

Cheers

On Dec 19, 2013, at 10:10 PM, Nick Pentreath  wrote:

> One option that is 3rd party that works nicely for the Hadoop project and 
> it's related projects is http://search-hadoop.com - managed by sematext. 
> Perhaps we can plead with Otis to add Spark lists to search-spark.com, or the 
> existing site?
> 
> Just throwing it out there as a potential solution to at least searching and 
> navigating the Apache lists
> 
> Sent from my iPad
> 
>> On 20 Dec 2013, at 6:46 AM, Aaron Davidson  wrote:
>> 
>> I'd be fine with one-way mirrors here (Apache threads being reflected in
>> Google groups) -- I have no idea how one is supposed to navigate the Apache
>> list to look for historic threads.
>> 
>> 
>>> On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:
>>> 
>>> Thanks very much for the prompt and comprehensive reply!  I appreciate the
>>> overarching desire to integrate with apache: I'm very happy to hear that
>>> there's a move to use the existing groups as mirrors: that will overcome
>>> all of my objections: particularly if it's bidirectional! :)
>>> 
>>> 
 On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
 
 Hey Mike,
 
 As you probably noticed when you CC'd spark-de...@googlegroups.com, that
 list has already be reconfigured so that it no longer allows posting (and
 bounces emails sent to it).
 
 We will be doing the same thing to the spark...@googlegroups.com list
 too (we'll announce a date for that soon).
 
 That may sound very frustrating, and you are *not* alone feeling that
 way. We've had a long conversation with our mentors about this, and I've
 felt very similar to you, so I'd like to give you background.
 
 As I'm coming to see it, part of becoming an Apache project is moving the
 community *fully* over to Apache infrastructure, and more generally the
 Apache way of organizing the community.
 
 This applies in both the nuts-and-bolts sense of being on apache infra,
 but possibly more importantly, it is also a guiding principle and way of
 thinking.
 
 In various ways, moving to apache Infra can be a painful process, and IMO
 the loss of all the great mailing list functionality that comes with using
 Google Groups is perhaps the most painful step. But basically, the de facto
 mailing lists need to be the Apache ones, and not Google Groups. The
 underlying reason is that Apache needs to take full accountability for
 recording and publishing the mailing lists, it has to be able to
 institutionally guarantee this. This is because discussion on mailing lists
 is one of the core things that defines an Apache community. So at a minimum
 this means Apache owning the master copy of the bits.
 
 All that said, we are discussing the possibility of having a google group
 that subscribes to each list that would provide an easier to use and
 prettier archive for each list (so far we haven't gotten that to work).
 
 I hope this was helpful. It has taken me a few years now, and a lot of
 conversations with experienced (and patient!) Apache mentors, to
 internalize some of the nuance about "the Apache way". That's why I wanted
 to share.
 
 Andy
 
> On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
> 
> I notice that there are still a lot of active topics in this group: and
> also activity on the apache mailing list (which is a really horrible
> experience!).  Is it a firm policy on apache's front to disallow external
> groups?  I'm going to be ramping up on spark, and I really hate the idea 
> of
> having to rely on the apache archives and my mail client.  Also: having to
> search for topics/keywords both in old threads (here) as well as new
> threads in apache's (clunky) archive, is going to be a pain!  I almost 
> feel
> like I must be missing something because the current solution seems
> unfeasibly awkward!
> 
> --
> You received this message because you are subscribed to the Google
> Groups "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to spark-users...@googlegroups.com.
> 
> For more options, visit https://groups.google.com/groups/opt_out.
 
 


Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Another option would be:
1. Add another recommendation model based on mrec's sgd based model: 
https://github.com/mendeley/mrec
2. Look at the streaming K-means from Mahout and see if that might be 
integrated or adapted into MLlib
3. Work on adding to or refactoring the existing linear model framework, for 
example adaptive learning rate schedules, adaptive norm stuff from John 
Langford et al
4. Adding sparse vector/matrix support to MLlib?

Sent from my iPad

> On 20 Dec 2013, at 3:46 AM, Tathagata Das  wrote:
> 
> +1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
> Streaming)
> 
> Something you can look into is implementing a streaming KMeans. Maybe you
> can re-use a lot of the offline KMeans code in MLLib.
> 
> TD
> 
> 
>> On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash  wrote:
>> 
>> Sounds like a great choice.  It would be particularly impressive if you
>> could add the first online learning algorithm (all the current ones are
>> offline I believe) to pave the way for future contributions.
>> 
>> 
>> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah 
>> wrote:
>> 
>>> Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
>> the
>>> project. Nice and self-contained.
>>> 
>>> -Matt Cheah
>>> 
>>> 
>>> On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen 
>>> wrote:
>>> 
 +1 to most of Andrew's suggestions here, and while we're in that
 neighborhood, how about generalizing something like "wtf-spark" (from
>> the
 Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
>> high
 academic interest, but it's something people would use many times a
 debugging day.
 
 Or am I behind and something like that is already there in 0.8?
 
 --
 Christopher T. Nguyen
 Co-founder & CEO, Adatao 
 linkedin.com/in/ctnguyen
 
 
 
> On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash 
 wrote:
 
> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
> 
> 1. Most places I deploy Spark in don't have internet access.  So I
>>> can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers
>>> manually.
> This is less a problem with Spark now that there are binary
 distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration
>> in
> Java system properties, environment variables, command line
>> parameters,
 and
> for the standalone cluster deployment mode you need to worry about
 whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties
>> in
 the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs
>> to
 be
> done, but it feels that there are gains to be made in configuration
 options
> here.  Ideally, I would have one configuration file that can be used
>> in
 all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for
>> starting,
> stopping, and keeping alive a service -- there are custom init
>> scripts
 that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh,
>> compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the
>>> bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper,
 which
> is widely-used for Java service daemons, supports retries,
>> installation
 as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the
>> "how
 do
> I keep a service running?" problem when it's been done so well by
>>> Tanuki
 --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick
>> bounce
>>> of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
>>> is
> entirely an Akka bug based on previous mailing list discussion with
 Matei,
> but it'd be awesome if you could use either the hostname or the FQDN
>> or
 the
> IP address in the Spark URL and not have Akka barf at you.
> 
> I've been telling myself I'd look into these at some point but just
 haven't
> gotten around to them myself yet.  Some day!  I would prioritize
>> these
> requests from most- to least-important as 3, 2, 4, 1.
> 
> Andrew
> 
> 
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
 nick.pentre...@gmail

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Nick Pentreath
One option that is 3rd party that works nicely for the Hadoop project and it's 
related projects is http://search-hadoop.com - managed by sematext. Perhaps we 
can plead with Otis to add Spark lists to search-spark.com, or the existing 
site?

Just throwing it out there as a potential solution to at least searching and 
navigating the Apache lists

Sent from my iPad

> On 20 Dec 2013, at 6:46 AM, Aaron Davidson  wrote:
> 
> I'd be fine with one-way mirrors here (Apache threads being reflected in
> Google groups) -- I have no idea how one is supposed to navigate the Apache
> list to look for historic threads.
> 
> 
>> On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:
>> 
>> Thanks very much for the prompt and comprehensive reply!  I appreciate the
>> overarching desire to integrate with apache: I'm very happy to hear that
>> there's a move to use the existing groups as mirrors: that will overcome
>> all of my objections: particularly if it's bidirectional! :)
>> 
>> 
>>> On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
>>> 
>>> Hey Mike,
>>> 
>>> As you probably noticed when you CC'd spark-de...@googlegroups.com, that
>>> list has already be reconfigured so that it no longer allows posting (and
>>> bounces emails sent to it).
>>> 
>>> We will be doing the same thing to the spark...@googlegroups.com list
>>> too (we'll announce a date for that soon).
>>> 
>>> That may sound very frustrating, and you are *not* alone feeling that
>>> way. We've had a long conversation with our mentors about this, and I've
>>> felt very similar to you, so I'd like to give you background.
>>> 
>>> As I'm coming to see it, part of becoming an Apache project is moving the
>>> community *fully* over to Apache infrastructure, and more generally the
>>> Apache way of organizing the community.
>>> 
>>> This applies in both the nuts-and-bolts sense of being on apache infra,
>>> but possibly more importantly, it is also a guiding principle and way of
>>> thinking.
>>> 
>>> In various ways, moving to apache Infra can be a painful process, and IMO
>>> the loss of all the great mailing list functionality that comes with using
>>> Google Groups is perhaps the most painful step. But basically, the de facto
>>> mailing lists need to be the Apache ones, and not Google Groups. The
>>> underlying reason is that Apache needs to take full accountability for
>>> recording and publishing the mailing lists, it has to be able to
>>> institutionally guarantee this. This is because discussion on mailing lists
>>> is one of the core things that defines an Apache community. So at a minimum
>>> this means Apache owning the master copy of the bits.
>>> 
>>> All that said, we are discussing the possibility of having a google group
>>> that subscribes to each list that would provide an easier to use and
>>> prettier archive for each list (so far we haven't gotten that to work).
>>> 
>>> I hope this was helpful. It has taken me a few years now, and a lot of
>>> conversations with experienced (and patient!) Apache mentors, to
>>> internalize some of the nuance about "the Apache way". That's why I wanted
>>> to share.
>>> 
>>> Andy
>>> 
 On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
 
 I notice that there are still a lot of active topics in this group: and
 also activity on the apache mailing list (which is a really horrible
 experience!).  Is it a firm policy on apache's front to disallow external
 groups?  I'm going to be ramping up on spark, and I really hate the idea of
 having to rely on the apache archives and my mail client.  Also: having to
 search for topics/keywords both in old threads (here) as well as new
 threads in apache's (clunky) archive, is going to be a pain!  I almost feel
 like I must be missing something because the current solution seems
 unfeasibly awkward!
 
 --
 You received this message because you are subscribed to the Google
 Groups "Spark Users" group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to spark-users...@googlegroups.com.
 
 For more options, visit https://groups.google.com/groups/opt_out.
>>> 
>>> 


Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Aaron Davidson
I'd be fine with one-way mirrors here (Apache threads being reflected in
Google groups) -- I have no idea how one is supposed to navigate the Apache
list to look for historic threads.


On Thu, Dec 19, 2013 at 7:58 PM, Mike Potts  wrote:

> Thanks very much for the prompt and comprehensive reply!  I appreciate the
> overarching desire to integrate with apache: I'm very happy to hear that
> there's a move to use the existing groups as mirrors: that will overcome
> all of my objections: particularly if it's bidirectional! :)
>
>
> On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
>
>> Hey Mike,
>>
>> As you probably noticed when you CC'd spark-de...@googlegroups.com, that
>> list has already be reconfigured so that it no longer allows posting (and
>> bounces emails sent to it).
>>
>> We will be doing the same thing to the spark...@googlegroups.com list
>> too (we'll announce a date for that soon).
>>
>> That may sound very frustrating, and you are *not* alone feeling that
>> way. We've had a long conversation with our mentors about this, and I've
>> felt very similar to you, so I'd like to give you background.
>>
>> As I'm coming to see it, part of becoming an Apache project is moving the
>> community *fully* over to Apache infrastructure, and more generally the
>> Apache way of organizing the community.
>>
>> This applies in both the nuts-and-bolts sense of being on apache infra,
>> but possibly more importantly, it is also a guiding principle and way of
>> thinking.
>>
>> In various ways, moving to apache Infra can be a painful process, and IMO
>> the loss of all the great mailing list functionality that comes with using
>> Google Groups is perhaps the most painful step. But basically, the de facto
>> mailing lists need to be the Apache ones, and not Google Groups. The
>> underlying reason is that Apache needs to take full accountability for
>> recording and publishing the mailing lists, it has to be able to
>> institutionally guarantee this. This is because discussion on mailing lists
>> is one of the core things that defines an Apache community. So at a minimum
>> this means Apache owning the master copy of the bits.
>>
>> All that said, we are discussing the possibility of having a google group
>> that subscribes to each list that would provide an easier to use and
>> prettier archive for each list (so far we haven't gotten that to work).
>>
>> I hope this was helpful. It has taken me a few years now, and a lot of
>> conversations with experienced (and patient!) Apache mentors, to
>> internalize some of the nuance about "the Apache way". That's why I wanted
>> to share.
>>
>> Andy
>>
>> On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:
>>
>>> I notice that there are still a lot of active topics in this group: and
>>> also activity on the apache mailing list (which is a really horrible
>>> experience!).  Is it a firm policy on apache's front to disallow external
>>> groups?  I'm going to be ramping up on spark, and I really hate the idea of
>>> having to rely on the apache archives and my mail client.  Also: having to
>>> search for topics/keywords both in old threads (here) as well as new
>>> threads in apache's (clunky) archive, is going to be a pain!  I almost feel
>>> like I must be missing something because the current solution seems
>>> unfeasibly awkward!
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Spark Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to spark-users...@googlegroups.com.
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>


Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Mike Potts
Thanks very much for the prompt and comprehensive reply!  I appreciate the 
overarching desire to integrate with apache: I'm very happy to hear that 
there's a move to use the existing groups as mirrors: that will overcome 
all of my objections: particularly if it's bidirectional! :)

On Thursday, December 19, 2013 7:19:06 PM UTC-8, Andy Konwinski wrote:
>
> Hey Mike,
>
> As you probably noticed when you CC'd 
> spark-de...@googlegroups.com, 
> that list has already be reconfigured so that it no longer allows posting 
> (and bounces emails sent to it).
>
> We will be doing the same thing to the 
> spark...@googlegroups.comlist too (we'll announce a date for 
> that soon).
>
> That may sound very frustrating, and you are *not* alone feeling that way. 
> We've had a long conversation with our mentors about this, and I've felt 
> very similar to you, so I'd like to give you background.
>
> As I'm coming to see it, part of becoming an Apache project is moving the 
> community *fully* over to Apache infrastructure, and more generally the 
> Apache way of organizing the community.
>
> This applies in both the nuts-and-bolts sense of being on apache infra, 
> but possibly more importantly, it is also a guiding principle and way of 
> thinking.
>
> In various ways, moving to apache Infra can be a painful process, and IMO 
> the loss of all the great mailing list functionality that comes with using 
> Google Groups is perhaps the most painful step. But basically, the de facto 
> mailing lists need to be the Apache ones, and not Google Groups. The 
> underlying reason is that Apache needs to take full accountability for 
> recording and publishing the mailing lists, it has to be able to 
> institutionally guarantee this. This is because discussion on mailing lists 
> is one of the core things that defines an Apache community. So at a minimum 
> this means Apache owning the master copy of the bits. 
>
> All that said, we are discussing the possibility of having a google group 
> that subscribes to each list that would provide an easier to use and 
> prettier archive for each list (so far we haven't gotten that to work).
>
> I hope this was helpful. It has taken me a few years now, and a lot of 
> conversations with experienced (and patient!) Apache mentors, to 
> internalize some of the nuance about "the Apache way". That's why I wanted 
> to share.
>
> Andy
>
> On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts 
> > wrote:
>
>> I notice that there are still a lot of active topics in this group: and 
>> also activity on the apache mailing list (which is a really horrible 
>> experience!).  Is it a firm policy on apache's front to disallow external 
>> groups?  I'm going to be ramping up on spark, and I really hate the idea of 
>> having to rely on the apache archives and my mail client.  Also: having to 
>> search for topics/keywords both in old threads (here) as well as new 
>> threads in apache's (clunky) archive, is going to be a pain!  I almost feel 
>> like I must be missing something because the current solution seems 
>> unfeasibly awkward!
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Spark Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to spark-users...@googlegroups.com .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-19 Thread Andy Konwinski
Hey Mike,

As you probably noticed when you CC'd spark-develop...@googlegroups.com,
that list has already be reconfigured so that it no longer allows posting
(and bounces emails sent to it).

We will be doing the same thing to the spark-us...@googlegroups.com list
too (we'll announce a date for that soon).

That may sound very frustrating, and you are *not* alone feeling that way.
We've had a long conversation with our mentors about this, and I've felt
very similar to you, so I'd like to give you background.

As I'm coming to see it, part of becoming an Apache project is moving the
community *fully* over to Apache infrastructure, and more generally the
Apache way of organizing the community.

This applies in both the nuts-and-bolts sense of being on apache infra, but
possibly more importantly, it is also a guiding principle and way of
thinking.

In various ways, moving to apache Infra can be a painful process, and IMO
the loss of all the great mailing list functionality that comes with using
Google Groups is perhaps the most painful step. But basically, the de facto
mailing lists need to be the Apache ones, and not Google Groups. The
underlying reason is that Apache needs to take full accountability for
recording and publishing the mailing lists, it has to be able to
institutionally guarantee this. This is because discussion on mailing lists
is one of the core things that defines an Apache community. So at a minimum
this means Apache owning the master copy of the bits.

All that said, we are discussing the possibility of having a google group
that subscribes to each list that would provide an easier to use and
prettier archive for each list (so far we haven't gotten that to work).

I hope this was helpful. It has taken me a few years now, and a lot of
conversations with experienced (and patient!) Apache mentors, to
internalize some of the nuance about "the Apache way". That's why I wanted
to share.

Andy

On Thu, Dec 19, 2013 at 6:28 PM, Mike Potts  wrote:

> I notice that there are still a lot of active topics in this group: and
> also activity on the apache mailing list (which is a really horrible
> experience!).  Is it a firm policy on apache's front to disallow external
> groups?  I'm going to be ramping up on spark, and I really hate the idea of
> having to rely on the apache archives and my mail client.  Also: having to
> search for topics/keywords both in old threads (here) as well as new
> threads in apache's (clunky) archive, is going to be a pain!  I almost feel
> like I must be missing something because the current solution seems
> unfeasibly awkward!
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to spark-users+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>


Re: Spark 0.8.1 Released

2013-12-19 Thread Matei Zaharia
Thanks Patrick for coordinating this release!

Matei

On Dec 19, 2013, at 5:15 PM, Patrick Wendell  wrote:

> Hi everyone,
> 
> We've just posted Spark 0.8.1, a new maintenance release that contains
> some bug fixes and improvements to the 0.8 branch. The full release
> notes are available at [1]. Apart from various bug fixes, 0.8.1
> includes support for YARN 2.2, a high availability mode for the
> standalone scheduler, and optimizations to the shuffle. We recommend
> that current users update to this release. You can grab the release at
> [2].
> 
> [1] http://spark.incubator.apache.org/releases/spark-release-0-8-1.html
> [2] http://spark.incubator.apache.org/downloads
> 
> Thanks to the following people who contributed to this release:
> 
> Michael Armbrust, Pierre Borckmans, Evan Chan, Ewen Cheslack, Mosharaf
> Chowdhury, Frank Dai, Aaron Davidson, Tathagata Das, Ankur Dave,
> Harvey Feng, Ali Ghodsi, Thomas Graves, Li Guoqiang, Stephen Haberman,
> Haidar Hadi, Nathan Howell, Holden Karau, Du Li, Raymond Liu, Xi Liu,
> David McCauley, Michael (wannabeast), Fabrizio Milo, Mridul
> Muralidharan, Sundeep Narravula, Kay Ousterhout, Nick Pentreath, Imran
> Rashid, Ahir Reddy, Josh Rosen, Henry Saputra, Jerry Shao, Mingfei
> Shi, Andre Schumacher, Karthik Tunga, Patrick Wendell, Neal Wiggins,
> Andrew Xia, Reynold Xin, Matei Zaharia, and Wu Zeming
> 
> - Patrick



Re: Spark development for undergraduate project

2013-12-19 Thread Tathagata Das
+1 to that (assuming by 'online' Andrew meant MLLib algorithm from Spark
Streaming)

Something you can look into is implementing a streaming KMeans. Maybe you
can re-use a lot of the offline KMeans code in MLLib.

TD


On Thu, Dec 19, 2013 at 5:33 PM, Andrew Ash  wrote:

> Sounds like a great choice.  It would be particularly impressive if you
> could add the first online learning algorithm (all the current ones are
> offline I believe) to pave the way for future contributions.
>
>
> On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah 
> wrote:
>
> > Thanks a lot everyone! I'm looking into adding an algorithm to MLib for
> the
> > project. Nice and self-contained.
> >
> > -Matt Cheah
> >
> >
> > On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen 
> > wrote:
> >
> > > +1 to most of Andrew's suggestions here, and while we're in that
> > > neighborhood, how about generalizing something like "wtf-spark" (from
> the
> > > Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of
> high
> > > academic interest, but it's something people would use many times a
> > > debugging day.
> > >
> > > Or am I behind and something like that is already there in 0.8?
> > >
> > > --
> > > Christopher T. Nguyen
> > > Co-founder & CEO, Adatao 
> > > linkedin.com/in/ctnguyen
> > >
> > >
> > >
> > > On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash 
> > wrote:
> > >
> > > > I think there are also some improvements that could be made to
> > > > deployability in an enterprise setting.  From my experience:
> > > >
> > > > 1. Most places I deploy Spark in don't have internet access.  So I
> > can't
> > > > build from source, compile against a different version of Hadoop, etc
> > > > without doing it locally and then getting that onto my servers
> > manually.
> > > >  This is less a problem with Spark now that there are binary
> > > distributions,
> > > > but it's still a problem for using Mesos with Spark.
> > > > 2. Configuration of Spark is confusing -- you can make configuration
> in
> > > > Java system properties, environment variables, command line
> parameters,
> > > and
> > > > for the standalone cluster deployment mode you need to worry about
> > > whether
> > > > these need to be set on the master, the worker, the executor, or the
> > > > application/driver program.  Also because spark-shell automatically
> > > > instantiates a SparkContext you have to set up any system properties
> in
> > > the
> > > > init scripts or on the command line with
> > > > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs
> to
> > > be
> > > > done, but it feels that there are gains to be made in configuration
> > > options
> > > > here.  Ideally, I would have one configuration file that can be used
> in
> > > all
> > > > 4 places and that's the only place to make configuration changes.
> > > > 3. Standalone cluster mode could use improved resiliency for
> starting,
> > > > stopping, and keeping alive a service -- there are custom init
> scripts
> > > that
> > > > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > > > spark-daemons.sh, spark-config.sh, spark-env.sh,
> compute-classpath.sh,
> > > > spark-executor, spark-class, run-example, and several others in the
> > bin/
> > > > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> > > which
> > > > is widely-used for Java service daemons, supports retries,
> installation
> > > as
> > > > init scripts that can be chkconfig'd, etc.  Let's not re-solve the
> "how
> > > do
> > > > I keep a service running?" problem when it's been done so well by
> > Tanuki
> > > --
> > > > we use it at my day job for all our services, plus it's used by
> > > > Elasticsearch.  This would help solve the problem where a quick
> bounce
> > of
> > > > the master causes all the workers to self-destruct.
> > > > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
> > is
> > > > entirely an Akka bug based on previous mailing list discussion with
> > > Matei,
> > > > but it'd be awesome if you could use either the hostname or the FQDN
> or
> > > the
> > > > IP address in the Spark URL and not have Akka barf at you.
> > > >
> > > > I've been telling myself I'd look into these at some point but just
> > > haven't
> > > > gotten around to them myself yet.  Some day!  I would prioritize
> these
> > > > requests from most- to least-important as 3, 2, 4, 1.
> > > >
> > > > Andrew
> > > >
> > > >
> > > > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> > > nick.pentre...@gmail.com
> > > > >wrote:
> > > >
> > > > > Or if you're extremely ambitious work in implementing Spark
> Streaming
> > > in
> > > > > Python—
> > > > > Sent from Mailbox for iPhone
> > > > >
> > > > > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <
> > > matei.zaha...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Matt,
> > > > > > If you want to get started looking at Spark, I recommend the
> > > following
> > > > > resources:
> > > > > > - Our issue tracker a

Re: Spark development for undergraduate project

2013-12-19 Thread Andrew Ash
Sounds like a great choice.  It would be particularly impressive if you
could add the first online learning algorithm (all the current ones are
offline I believe) to pave the way for future contributions.


On Thu, Dec 19, 2013 at 8:27 PM, Matthew Cheah  wrote:

> Thanks a lot everyone! I'm looking into adding an algorithm to MLib for the
> project. Nice and self-contained.
>
> -Matt Cheah
>
>
> On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen 
> wrote:
>
> > +1 to most of Andrew's suggestions here, and while we're in that
> > neighborhood, how about generalizing something like "wtf-spark" (from the
> > Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high
> > academic interest, but it's something people would use many times a
> > debugging day.
> >
> > Or am I behind and something like that is already there in 0.8?
> >
> > --
> > Christopher T. Nguyen
> > Co-founder & CEO, Adatao 
> > linkedin.com/in/ctnguyen
> >
> >
> >
> > On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash 
> wrote:
> >
> > > I think there are also some improvements that could be made to
> > > deployability in an enterprise setting.  From my experience:
> > >
> > > 1. Most places I deploy Spark in don't have internet access.  So I
> can't
> > > build from source, compile against a different version of Hadoop, etc
> > > without doing it locally and then getting that onto my servers
> manually.
> > >  This is less a problem with Spark now that there are binary
> > distributions,
> > > but it's still a problem for using Mesos with Spark.
> > > 2. Configuration of Spark is confusing -- you can make configuration in
> > > Java system properties, environment variables, command line parameters,
> > and
> > > for the standalone cluster deployment mode you need to worry about
> > whether
> > > these need to be set on the master, the worker, the executor, or the
> > > application/driver program.  Also because spark-shell automatically
> > > instantiates a SparkContext you have to set up any system properties in
> > the
> > > init scripts or on the command line with
> > > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to
> > be
> > > done, but it feels that there are gains to be made in configuration
> > options
> > > here.  Ideally, I would have one configuration file that can be used in
> > all
> > > 4 places and that's the only place to make configuration changes.
> > > 3. Standalone cluster mode could use improved resiliency for starting,
> > > stopping, and keeping alive a service -- there are custom init scripts
> > that
> > > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> > > spark-executor, spark-class, run-example, and several others in the
> bin/
> > > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> > which
> > > is widely-used for Java service daemons, supports retries, installation
> > as
> > > init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how
> > do
> > > I keep a service running?" problem when it's been done so well by
> Tanuki
> > --
> > > we use it at my day job for all our services, plus it's used by
> > > Elasticsearch.  This would help solve the problem where a quick bounce
> of
> > > the master causes all the workers to self-destruct.
> > > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this
> is
> > > entirely an Akka bug based on previous mailing list discussion with
> > Matei,
> > > but it'd be awesome if you could use either the hostname or the FQDN or
> > the
> > > IP address in the Spark URL and not have Akka barf at you.
> > >
> > > I've been telling myself I'd look into these at some point but just
> > haven't
> > > gotten around to them myself yet.  Some day!  I would prioritize these
> > > requests from most- to least-important as 3, 2, 4, 1.
> > >
> > > Andrew
> > >
> > >
> > > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> > nick.pentre...@gmail.com
> > > >wrote:
> > >
> > > > Or if you're extremely ambitious work in implementing Spark Streaming
> > in
> > > > Python—
> > > > Sent from Mailbox for iPhone
> > > >
> > > > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <
> > matei.zaha...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Matt,
> > > > > If you want to get started looking at Spark, I recommend the
> > following
> > > > resources:
> > > > > - Our issue tracker at http://spark-project.atlassian.net contains
> > > some
> > > > issues marked “Starter” that are good places to jump into. You might
> be
> > > > able to take one of those and extend it into a bigger project.
> > > > > - The “contributing to Spark” wiki page covers how to send patches
> > and
> > > > set up development:
> > > >
> > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > > > > - This talk has an intro to Spark internals (video and slides are
> in
> > > the
> > > > comments): http://www.meetup.co

Re: Spark development for undergraduate project

2013-12-19 Thread Matthew Cheah
Thanks a lot everyone! I'm looking into adding an algorithm to MLib for the
project. Nice and self-contained.

-Matt Cheah


On Thu, Dec 19, 2013 at 12:52 PM, Christopher Nguyen  wrote:

> +1 to most of Andrew's suggestions here, and while we're in that
> neighborhood, how about generalizing something like "wtf-spark" (from the
> Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high
> academic interest, but it's something people would use many times a
> debugging day.
>
> Or am I behind and something like that is already there in 0.8?
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao 
> linkedin.com/in/ctnguyen
>
>
>
> On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash  wrote:
>
> > I think there are also some improvements that could be made to
> > deployability in an enterprise setting.  From my experience:
> >
> > 1. Most places I deploy Spark in don't have internet access.  So I can't
> > build from source, compile against a different version of Hadoop, etc
> > without doing it locally and then getting that onto my servers manually.
> >  This is less a problem with Spark now that there are binary
> distributions,
> > but it's still a problem for using Mesos with Spark.
> > 2. Configuration of Spark is confusing -- you can make configuration in
> > Java system properties, environment variables, command line parameters,
> and
> > for the standalone cluster deployment mode you need to worry about
> whether
> > these need to be set on the master, the worker, the executor, or the
> > application/driver program.  Also because spark-shell automatically
> > instantiates a SparkContext you have to set up any system properties in
> the
> > init scripts or on the command line with
> > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to
> be
> > done, but it feels that there are gains to be made in configuration
> options
> > here.  Ideally, I would have one configuration file that can be used in
> all
> > 4 places and that's the only place to make configuration changes.
> > 3. Standalone cluster mode could use improved resiliency for starting,
> > stopping, and keeping alive a service -- there are custom init scripts
> that
> > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> > spark-executor, spark-class, run-example, and several others in the bin/
> > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> which
> > is widely-used for Java service daemons, supports retries, installation
> as
> > init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how
> do
> > I keep a service running?" problem when it's been done so well by Tanuki
> --
> > we use it at my day job for all our services, plus it's used by
> > Elasticsearch.  This would help solve the problem where a quick bounce of
> > the master causes all the workers to self-destruct.
> > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> > entirely an Akka bug based on previous mailing list discussion with
> Matei,
> > but it'd be awesome if you could use either the hostname or the FQDN or
> the
> > IP address in the Spark URL and not have Akka barf at you.
> >
> > I've been telling myself I'd look into these at some point but just
> haven't
> > gotten around to them myself yet.  Some day!  I would prioritize these
> > requests from most- to least-important as 3, 2, 4, 1.
> >
> > Andrew
> >
> >
> > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> nick.pentre...@gmail.com
> > >wrote:
> >
> > > Or if you're extremely ambitious work in implementing Spark Streaming
> in
> > > Python—
> > > Sent from Mailbox for iPhone
> > >
> > > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia <
> matei.zaha...@gmail.com>
> > > wrote:
> > >
> > > > Hi Matt,
> > > > If you want to get started looking at Spark, I recommend the
> following
> > > resources:
> > > > - Our issue tracker at http://spark-project.atlassian.net contains
> > some
> > > issues marked “Starter” that are good places to jump into. You might be
> > > able to take one of those and extend it into a bigger project.
> > > > - The “contributing to Spark” wiki page covers how to send patches
> and
> > > set up development:
> > >
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > > > - This talk has an intro to Spark internals (video and slides are in
> > the
> > > comments): http://www.meetup.com/spark-users/events/94101942/
> > > > For a longer project, here are some possible ones:
> > > > - Create a tool that automatically checks which Scala API methods are
> > > missing in Python. We had a similar one for Java that was very useful.
> > Even
> > > better would be to automatically create wrappers for the Scala ones.
> > > > - Extend the Spark monitoring UI with profiling information (to
> sample
> > > the workers and say where they’re spending time, or what data
> structures
> > > consum

Spark 0.8.1 Released

2013-12-19 Thread Patrick Wendell
Hi everyone,

We've just posted Spark 0.8.1, a new maintenance release that contains
some bug fixes and improvements to the 0.8 branch. The full release
notes are available at [1]. Apart from various bug fixes, 0.8.1
includes support for YARN 2.2, a high availability mode for the
standalone scheduler, and optimizations to the shuffle. We recommend
that current users update to this release. You can grab the release at
[2].

[1] http://spark.incubator.apache.org/releases/spark-release-0-8-1.html
[2] http://spark.incubator.apache.org/downloads

Thanks to the following people who contributed to this release:

Michael Armbrust, Pierre Borckmans, Evan Chan, Ewen Cheslack, Mosharaf
Chowdhury, Frank Dai, Aaron Davidson, Tathagata Das, Ankur Dave,
Harvey Feng, Ali Ghodsi, Thomas Graves, Li Guoqiang, Stephen Haberman,
Haidar Hadi, Nathan Howell, Holden Karau, Du Li, Raymond Liu, Xi Liu,
David McCauley, Michael (wannabeast), Fabrizio Milo, Mridul
Muralidharan, Sundeep Narravula, Kay Ousterhout, Nick Pentreath, Imran
Rashid, Ahir Reddy, Josh Rosen, Henry Saputra, Jerry Shao, Mingfei
Shi, Andre Schumacher, Karthik Tunga, Patrick Wendell, Neal Wiggins,
Andrew Xia, Reynold Xin, Matei Zaharia, and Wu Zeming

- Patrick


Re: How to contribute to the spark project

2013-12-19 Thread Azuryy Yu
Hi Gill,
please read here:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 On Dec 20, 2013 5:39 AM, "Gill"  wrote:

> Hi,
>
> I attended the spark summit and have been curios to know that how can I
> contribute to the spark project. I'm working on query engine optimizations
> so can help with spark query engine optimizations or with other query
> engine features.
>
> Thanks
> Gurbir
> 510 410 5108
>


How to contribute to the spark project

2013-12-19 Thread Gill
Hi,

I attended the spark summit and have been curios to know that how can I
contribute to the spark project. I'm working on query engine optimizations
so can help with spark query engine optimizations or with other query
engine features.

Thanks
Gurbir
510 410 5108


Re: Spark development for undergraduate project

2013-12-19 Thread Christopher Nguyen
+1 to most of Andrew's suggestions here, and while we're in that
neighborhood, how about generalizing something like "wtf-spark" (from the
Bizo team (http://youtu.be/6Sn1xs5DN1Y?t=38m36s)? It may not be of high
academic interest, but it's something people would use many times a
debugging day.

Or am I behind and something like that is already there in 0.8?

--
Christopher T. Nguyen
Co-founder & CEO, Adatao 
linkedin.com/in/ctnguyen



On Thu, Dec 19, 2013 at 10:56 AM, Andrew Ash  wrote:

> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
>
> 1. Most places I deploy Spark in don't have internet access.  So I can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers manually.
>  This is less a problem with Spark now that there are binary distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration in
> Java system properties, environment variables, command line parameters, and
> for the standalone cluster deployment mode you need to worry about whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties in the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
> done, but it feels that there are gains to be made in configuration options
> here.  Ideally, I would have one configuration file that can be used in all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for starting,
> stopping, and keeping alive a service -- there are custom init scripts that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper, which
> is widely-used for Java service daemons, supports retries, installation as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
> I keep a service running?" problem when it's been done so well by Tanuki --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick bounce of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> entirely an Akka bug based on previous mailing list discussion with Matei,
> but it'd be awesome if you could use either the hostname or the FQDN or the
> IP address in the Spark URL and not have Akka barf at you.
>
> I've been telling myself I'd look into these at some point but just haven't
> gotten around to them myself yet.  Some day!  I would prioritize these
> requests from most- to least-important as 3, 2, 4, 1.
>
> Andrew
>
>
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath  >wrote:
>
> > Or if you're extremely ambitious work in implementing Spark Streaming in
> > Python—
> > Sent from Mailbox for iPhone
> >
> > On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
> > wrote:
> >
> > > Hi Matt,
> > > If you want to get started looking at Spark, I recommend the following
> > resources:
> > > - Our issue tracker at http://spark-project.atlassian.net contains
> some
> > issues marked “Starter” that are good places to jump into. You might be
> > able to take one of those and extend it into a bigger project.
> > > - The “contributing to Spark” wiki page covers how to send patches and
> > set up development:
> > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > > - This talk has an intro to Spark internals (video and slides are in
> the
> > comments): http://www.meetup.com/spark-users/events/94101942/
> > > For a longer project, here are some possible ones:
> > > - Create a tool that automatically checks which Scala API methods are
> > missing in Python. We had a similar one for Java that was very useful.
> Even
> > better would be to automatically create wrappers for the Scala ones.
> > > - Extend the Spark monitoring UI with profiling information (to sample
> > the workers and say where they’re spending time, or what data structures
> > consume the most memory).
> > > - Pick and implement a new machine learning algorithm for MLlib.
> > > Matei
> > > On Dec 17, 2013, at 10:43 AM, Matthew Cheah 
> > wrote:
> > >> Hi everyone,
> > >>
> > >> During my most recent internship, I worked extensively with Apache
> > Spark,
> > >> integrating it into a company's data analytics platform. I've now
> become
> > >> interested in contributing to Apache Spark.
> > >>
> > >> I'm returning to undergraduate studies 

Re: Spark development for undergraduate project

2013-12-19 Thread Andrew Ash
Wow yes, that PR#230 looks like exactly what I outlined in #2!  I'll leave
some comments on there.

Anything going on for service reliability (#3) since apparently someone is
reading my mind?


On Thu, Dec 19, 2013 at 2:02 PM, Nick Pentreath wrote:

> Some good things to look at though hopefully #2 will be largely addressed
> by: https://github.com/apache/incubator-spark/pull/230—
> Sent from Mailbox for iPhone
>
> On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash  wrote:
>
> > I think there are also some improvements that could be made to
> > deployability in an enterprise setting.  From my experience:
> > 1. Most places I deploy Spark in don't have internet access.  So I can't
> > build from source, compile against a different version of Hadoop, etc
> > without doing it locally and then getting that onto my servers manually.
> >  This is less a problem with Spark now that there are binary
> distributions,
> > but it's still a problem for using Mesos with Spark.
> > 2. Configuration of Spark is confusing -- you can make configuration in
> > Java system properties, environment variables, command line parameters,
> and
> > for the standalone cluster deployment mode you need to worry about
> whether
> > these need to be set on the master, the worker, the executor, or the
> > application/driver program.  Also because spark-shell automatically
> > instantiates a SparkContext you have to set up any system properties in
> the
> > init scripts or on the command line with
> > JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to
> be
> > done, but it feels that there are gains to be made in configuration
> options
> > here.  Ideally, I would have one configuration file that can be used in
> all
> > 4 places and that's the only place to make configuration changes.
> > 3. Standalone cluster mode could use improved resiliency for starting,
> > stopping, and keeping alive a service -- there are custom init scripts
> that
> > call each other in a mess of ways: spark-shell, spark-daemon.sh,
> > spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> > spark-executor, spark-class, run-example, and several others in the bin/
> > directory.  I would love it if Spark used the Tanuki Service Wrapper,
> which
> > is widely-used for Java service daemons, supports retries, installation
> as
> > init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how
> do
> > I keep a service running?" problem when it's been done so well by Tanuki
> --
> > we use it at my day job for all our services, plus it's used by
> > Elasticsearch.  This would help solve the problem where a quick bounce of
> > the master causes all the workers to self-destruct.
> > 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> > entirely an Akka bug based on previous mailing list discussion with
> Matei,
> > but it'd be awesome if you could use either the hostname or the FQDN or
> the
> > IP address in the Spark URL and not have Akka barf at you.
> > I've been telling myself I'd look into these at some point but just
> haven't
> > gotten around to them myself yet.  Some day!  I would prioritize these
> > requests from most- to least-important as 3, 2, 4, 1.
> > Andrew
> > On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath <
> nick.pentre...@gmail.com>wrote:
> >> Or if you're extremely ambitious work in implementing Spark Streaming in
> >> Python—
> >> Sent from Mailbox for iPhone
> >>
> >> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia  >
> >> wrote:
> >>
> >> > Hi Matt,
> >> > If you want to get started looking at Spark, I recommend the following
> >> resources:
> >> > - Our issue tracker at http://spark-project.atlassian.net contains
> some
> >> issues marked “Starter” that are good places to jump into. You might be
> >> able to take one of those and extend it into a bigger project.
> >> > - The “contributing to Spark” wiki page covers how to send patches and
> >> set up development:
> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> >> > - This talk has an intro to Spark internals (video and slides are in
> the
> >> comments): http://www.meetup.com/spark-users/events/94101942/
> >> > For a longer project, here are some possible ones:
> >> > - Create a tool that automatically checks which Scala API methods are
> >> missing in Python. We had a similar one for Java that was very useful.
> Even
> >> better would be to automatically create wrappers for the Scala ones.
> >> > - Extend the Spark monitoring UI with profiling information (to sample
> >> the workers and say where they’re spending time, or what data structures
> >> consume the most memory).
> >> > - Pick and implement a new machine learning algorithm for MLlib.
> >> > Matei
> >> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah 
> >> wrote:
> >> >> Hi everyone,
> >> >>
> >> >> During my most recent internship, I worked extensively with Apache
> >> Spark,
> >> >> integrating it into a company's data analytics platform. I'

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Some good things to look at though hopefully #2 will be largely addressed by: 
https://github.com/apache/incubator-spark/pull/230—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:57 PM, Andrew Ash  wrote:

> I think there are also some improvements that could be made to
> deployability in an enterprise setting.  From my experience:
> 1. Most places I deploy Spark in don't have internet access.  So I can't
> build from source, compile against a different version of Hadoop, etc
> without doing it locally and then getting that onto my servers manually.
>  This is less a problem with Spark now that there are binary distributions,
> but it's still a problem for using Mesos with Spark.
> 2. Configuration of Spark is confusing -- you can make configuration in
> Java system properties, environment variables, command line parameters, and
> for the standalone cluster deployment mode you need to worry about whether
> these need to be set on the master, the worker, the executor, or the
> application/driver program.  Also because spark-shell automatically
> instantiates a SparkContext you have to set up any system properties in the
> init scripts or on the command line with
> JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
> done, but it feels that there are gains to be made in configuration options
> here.  Ideally, I would have one configuration file that can be used in all
> 4 places and that's the only place to make configuration changes.
> 3. Standalone cluster mode could use improved resiliency for starting,
> stopping, and keeping alive a service -- there are custom init scripts that
> call each other in a mess of ways: spark-shell, spark-daemon.sh,
> spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
> spark-executor, spark-class, run-example, and several others in the bin/
> directory.  I would love it if Spark used the Tanuki Service Wrapper, which
> is widely-used for Java service daemons, supports retries, installation as
> init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
> I keep a service running?" problem when it's been done so well by Tanuki --
> we use it at my day job for all our services, plus it's used by
> Elasticsearch.  This would help solve the problem where a quick bounce of
> the master causes all the workers to self-destruct.
> 4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
> entirely an Akka bug based on previous mailing list discussion with Matei,
> but it'd be awesome if you could use either the hostname or the FQDN or the
> IP address in the Spark URL and not have Akka barf at you.
> I've been telling myself I'd look into these at some point but just haven't
> gotten around to them myself yet.  Some day!  I would prioritize these
> requests from most- to least-important as 3, 2, 4, 1.
> Andrew
> On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath 
> wrote:
>> Or if you're extremely ambitious work in implementing Spark Streaming in
>> Python—
>> Sent from Mailbox for iPhone
>>
>> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
>> wrote:
>>
>> > Hi Matt,
>> > If you want to get started looking at Spark, I recommend the following
>> resources:
>> > - Our issue tracker at http://spark-project.atlassian.net contains some
>> issues marked “Starter” that are good places to jump into. You might be
>> able to take one of those and extend it into a bigger project.
>> > - The “contributing to Spark” wiki page covers how to send patches and
>> set up development:
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> > - This talk has an intro to Spark internals (video and slides are in the
>> comments): http://www.meetup.com/spark-users/events/94101942/
>> > For a longer project, here are some possible ones:
>> > - Create a tool that automatically checks which Scala API methods are
>> missing in Python. We had a similar one for Java that was very useful. Even
>> better would be to automatically create wrappers for the Scala ones.
>> > - Extend the Spark monitoring UI with profiling information (to sample
>> the workers and say where they’re spending time, or what data structures
>> consume the most memory).
>> > - Pick and implement a new machine learning algorithm for MLlib.
>> > Matei
>> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah 
>> wrote:
>> >> Hi everyone,
>> >>
>> >> During my most recent internship, I worked extensively with Apache
>> Spark,
>> >> integrating it into a company's data analytics platform. I've now become
>> >> interested in contributing to Apache Spark.
>> >>
>> >> I'm returning to undergraduate studies in January and there is an
>> academic
>> >> course which is simply a standalone software engineering project. I was
>> >> thinking that some contribution to Apache Spark would satisfy my
>> curiosity,
>> >> help continue support the company I interned at, and give me academic
>> >> credits required to graduate, all at the same time. It seems like too

Re: [PySpark]: reading arbitrary Hadoop InputFormats

2013-12-19 Thread Nick Pentreath
Hi


I managed to find the time to put together a PR on this: 
https://github.com/apache/incubator-spark/pull/263




Josh has had a look over it - if anyone else with an interest could give some 
feedback that would be great.




As mentioned in the PR it's more of an RFC and certainly still needs a bit of 
clean up work, and I need to add the concept of "wrapper functions" to 
deserialize classes that MsgPack can't handle out the box.




N
—
Sent from Mailbox for iPhone

On Fri, Nov 8, 2013 at 12:20 PM, Nick Pentreath 
wrote:

> Wow Josh, that looks great. I've been a bit swamped this week but as soon
> as I get a chance I'll test out the PR in more detail and port over the
> InputFormat stuff to use the new framework (including the changes you
> suggested).
> I can then look deeper into the MsgPack functionality to see if it can be
> made to work in a generic enough manner without requiring huge amounts of
> custom Templates to be written by users.
> Will feed back asap.
> N
> On Thu, Nov 7, 2013 at 5:03 AM, Josh Rosen  wrote:
>> I opened a pull request to add custom serializer support to PySpark:
>> https://github.com/apache/incubator-spark/pull/146
>>
>> My pull request adds the plumbing for transferring data from Java to Python
>> using formats other than Pickle.  For example, look at how textFile() uses
>> MUTF8Deserializer to read strings from Java.  Hopefully this provides all
>> of the functionality needed to support MsgPack.
>>
>> - Josh
>>
>>
>> On Thu, Oct 31, 2013 at 11:11 AM, Josh Rosen  wrote:
>>
>> > Hi Nick,
>> >
>> > This is a nice start.  I'd prefer to keep the Java sequenceFileAsText()
>> > and newHadoopFileAsText() methods inside PythonRDD instead of adding them
>> > to JavaSparkContext, since I think these methods are unlikely to be used
>> > directly by Java users (you can add these methods to the PythonRDD
>> > companion object, which is how readRDDFromPickleFile is implemented:
>> >
>> https://github.com/apache/incubator-spark/blob/branch-0.8/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala#L255
>> > )
>> >
>> > For MsgPack, the UnpicklingError is because the Python worker expects to
>> > receive its input in a pickled format.  In my prototype of custom
>> > serializers, I modified the PySpark worker to receive its
>> > serialization/deserialization function as input (
>> >
>> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/worker.py#L41
>> )
>> > and added logic to pass the appropriate serializers based on each stage's
>> > input and output formats (
>> >
>> https://github.com/JoshRosen/spark/blob/59b6b43916dc84fc8b83f22eb9ce13a27bc51ec0/python/pyspark/rdd.py#L42
>> > ).
>> >
>> > At some point, I'd like to port my custom serializers code to PySpark; if
>> > anyone's interested in helping, I'd be glad to write up some additional
>> > notes on how this should work.
>> >
>> > - Josh
>> >
>> > On Wed, Oct 30, 2013 at 2:25 PM, Nick Pentreath <
>> nick.pentre...@gmail.com>wrote:
>> >
>> >> Thanks Josh, Patrick for the feedback.
>> >>
>> >> Based on Josh's pointers I have something working for JavaPairRDD ->
>> >> PySpark RDD[(String, String)]. This just calls the toString method on
>> each
>> >> key and value as before, but without the need for a delimiter. For
>> >> SequenceFile, it uses SequenceFileAsTextInputFormat which itself calls
>> >> toString to convert to Text for keys and values. We then call toString
>> >> (again) ourselves to get Strings to feed to writeAsPickle.
>> >>
>> >> Details here: https://gist.github.com/MLnick/7230588
>> >>
>> >> This also illustrates where the "wrapper function" api would fit in. All
>> >> that is required is to define a T => String for key and value.
>> >>
>> >> I started playing around with MsgPack and can sort of get things to work
>> >> in
>> >> Scala, but am struggling with getting the raw bytes to be written
>> properly
>> >> in PythonRDD (I think it is treating them as pickled byte arrays when
>> they
>> >> are not, but when I removed the 'stripPickle' calls and amended the
>> length
>> >> (-6) I got "UnpicklingError: invalid load key, ' '. ").
>> >>
>> >> Another issue is that MsgPack does well at writing "structures" - like
>> >> Java
>> >> classes with public fields that are fairly simple - but for example the
>> >> Writables have private fields so you end up with nothing being written.
>> >> This looks like it would require custom "Templates" (serialization
>> >> functions effectively) for many classes, which means a lot of custom
>> code
>> >> for a user to write to use it. Fortunately for most of the common
>> >> Writables
>> >> a toString does the job. Will keep looking into it though.
>> >>
>> >> Anyway, Josh if you have ideas or examples on the "Wrapper API from
>> >> Python"
>> >> that you mentioned, I'd be interested to hear them.
>> >>
>> >> If you think this is worth working up as a Pull Request covering
>> >> SequenceFiles and custom InputFormats with default to

Re: Spark development for undergraduate project

2013-12-19 Thread Andrew Ash
I think there are also some improvements that could be made to
deployability in an enterprise setting.  From my experience:

1. Most places I deploy Spark in don't have internet access.  So I can't
build from source, compile against a different version of Hadoop, etc
without doing it locally and then getting that onto my servers manually.
 This is less a problem with Spark now that there are binary distributions,
but it's still a problem for using Mesos with Spark.
2. Configuration of Spark is confusing -- you can make configuration in
Java system properties, environment variables, command line parameters, and
for the standalone cluster deployment mode you need to worry about whether
these need to be set on the master, the worker, the executor, or the
application/driver program.  Also because spark-shell automatically
instantiates a SparkContext you have to set up any system properties in the
init scripts or on the command line with
JAVA_OPTS="-Dspark.executor.memory=8g" etc.  I'm not sure what needs to be
done, but it feels that there are gains to be made in configuration options
here.  Ideally, I would have one configuration file that can be used in all
4 places and that's the only place to make configuration changes.
3. Standalone cluster mode could use improved resiliency for starting,
stopping, and keeping alive a service -- there are custom init scripts that
call each other in a mess of ways: spark-shell, spark-daemon.sh,
spark-daemons.sh, spark-config.sh, spark-env.sh, compute-classpath.sh,
spark-executor, spark-class, run-example, and several others in the bin/
directory.  I would love it if Spark used the Tanuki Service Wrapper, which
is widely-used for Java service daemons, supports retries, installation as
init scripts that can be chkconfig'd, etc.  Let's not re-solve the "how do
I keep a service running?" problem when it's been done so well by Tanuki --
we use it at my day job for all our services, plus it's used by
Elasticsearch.  This would help solve the problem where a quick bounce of
the master causes all the workers to self-destruct.
4. Sensitivity to hostname vs FQDN vs IP address in spark URL -- this is
entirely an Akka bug based on previous mailing list discussion with Matei,
but it'd be awesome if you could use either the hostname or the FQDN or the
IP address in the Spark URL and not have Akka barf at you.

I've been telling myself I'd look into these at some point but just haven't
gotten around to them myself yet.  Some day!  I would prioritize these
requests from most- to least-important as 3, 2, 4, 1.

Andrew


On Thu, Dec 19, 2013 at 1:38 PM, Nick Pentreath wrote:

> Or if you're extremely ambitious work in implementing Spark Streaming in
> Python—
> Sent from Mailbox for iPhone
>
> On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
> wrote:
>
> > Hi Matt,
> > If you want to get started looking at Spark, I recommend the following
> resources:
> > - Our issue tracker at http://spark-project.atlassian.net contains some
> issues marked “Starter” that are good places to jump into. You might be
> able to take one of those and extend it into a bigger project.
> > - The “contributing to Spark” wiki page covers how to send patches and
> set up development:
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> > - This talk has an intro to Spark internals (video and slides are in the
> comments): http://www.meetup.com/spark-users/events/94101942/
> > For a longer project, here are some possible ones:
> > - Create a tool that automatically checks which Scala API methods are
> missing in Python. We had a similar one for Java that was very useful. Even
> better would be to automatically create wrappers for the Scala ones.
> > - Extend the Spark monitoring UI with profiling information (to sample
> the workers and say where they’re spending time, or what data structures
> consume the most memory).
> > - Pick and implement a new machine learning algorithm for MLlib.
> > Matei
> > On Dec 17, 2013, at 10:43 AM, Matthew Cheah 
> wrote:
> >> Hi everyone,
> >>
> >> During my most recent internship, I worked extensively with Apache
> Spark,
> >> integrating it into a company's data analytics platform. I've now become
> >> interested in contributing to Apache Spark.
> >>
> >> I'm returning to undergraduate studies in January and there is an
> academic
> >> course which is simply a standalone software engineering project. I was
> >> thinking that some contribution to Apache Spark would satisfy my
> curiosity,
> >> help continue support the company I interned at, and give me academic
> >> credits required to graduate, all at the same time. It seems like too
> good
> >> an opportunity to pass up.
> >>
> >> With that in mind, I have the following questions:
> >>
> >>   1. At this point, is there any self-contained project that I could
> work
> >>   on within Spark? Ideally, I would work on it independently, in about a
> >>   three month time frame. This time also needs to accommodate r

Re: Spark development for undergraduate project

2013-12-19 Thread Nick Pentreath
Or if you're extremely ambitious work in implementing Spark Streaming in Python—
Sent from Mailbox for iPhone

On Thu, Dec 19, 2013 at 8:30 PM, Matei Zaharia 
wrote:

> Hi Matt,
> If you want to get started looking at Spark, I recommend the following 
> resources:
> - Our issue tracker at http://spark-project.atlassian.net contains some 
> issues marked “Starter” that are good places to jump into. You might be able 
> to take one of those and extend it into a bigger project.
> - The “contributing to Spark” wiki page covers how to send patches and set up 
> development: 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> - This talk has an intro to Spark internals (video and slides are in the 
> comments): http://www.meetup.com/spark-users/events/94101942/
> For a longer project, here are some possible ones:
> - Create a tool that automatically checks which Scala API methods are missing 
> in Python. We had a similar one for Java that was very useful. Even better 
> would be to automatically create wrappers for the Scala ones.
> - Extend the Spark monitoring UI with profiling information (to sample the 
> workers and say where they’re spending time, or what data structures consume 
> the most memory).
> - Pick and implement a new machine learning algorithm for MLlib.
> Matei
> On Dec 17, 2013, at 10:43 AM, Matthew Cheah  wrote:
>> Hi everyone,
>> 
>> During my most recent internship, I worked extensively with Apache Spark,
>> integrating it into a company's data analytics platform. I've now become
>> interested in contributing to Apache Spark.
>> 
>> I'm returning to undergraduate studies in January and there is an academic
>> course which is simply a standalone software engineering project. I was
>> thinking that some contribution to Apache Spark would satisfy my curiosity,
>> help continue support the company I interned at, and give me academic
>> credits required to graduate, all at the same time. It seems like too good
>> an opportunity to pass up.
>> 
>> With that in mind, I have the following questions:
>> 
>>   1. At this point, is there any self-contained project that I could work
>>   on within Spark? Ideally, I would work on it independently, in about a
>>   three month time frame. This time also needs to accommodate ramping up on
>>   the Spark codebase and adjusting to the Scala programming language and
>>   paradigms. The company I worked at primarily used the Java APIs. The output
>>   needs to be a technical report describing the project requirements, and the
>>   design process I took to engineer the solution for the requirements. In
>>   particular, it cannot just be a series of haphazard patches.
>>   2. How can I get started with contributing to Spark?
>>   3. Is there a high-level UML or some other design specification for the
>>   Spark architecture?
>> 
>> Thanks! I hope to be of some help =)
>> 
>> -Matt Cheah

Re: View bound deprecation (Scala 2.11+)

2013-12-19 Thread Matei Zaharia
We can open a JIRA but let’s wait to see what the Scala guys decide. I’m sure 
they’ll recommend some alternatives.

Matei

On Dec 19, 2013, at 9:27 AM, Marek Kolodziej  wrote:

> All,
> 
> Apparently view bounds will be deprecated going forward. Hopefully they'll
> be around for a while after deprecation, but I wanted to raise this issue
> for consideration. Here's the SIP:
> https://issues.scala-lang.org/browse/SI-7629
> 
> Shall I file a Jira for that?
> 
> Thanks!
> 
> Marek



Re: Spark development for undergraduate project

2013-12-19 Thread Matei Zaharia
Hi Matt,

If you want to get started looking at Spark, I recommend the following 
resources:

- Our issue tracker at http://spark-project.atlassian.net contains some issues 
marked “Starter” that are good places to jump into. You might be able to take 
one of those and extend it into a bigger project.

- The “contributing to Spark” wiki page covers how to send patches and set up 
development: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 

- This talk has an intro to Spark internals (video and slides are in the 
comments): http://www.meetup.com/spark-users/events/94101942/

For a longer project, here are some possible ones:

- Create a tool that automatically checks which Scala API methods are missing 
in Python. We had a similar one for Java that was very useful. Even better 
would be to automatically create wrappers for the Scala ones.

- Extend the Spark monitoring UI with profiling information (to sample the 
workers and say where they’re spending time, or what data structures consume 
the most memory).

- Pick and implement a new machine learning algorithm for MLlib.

Matei

On Dec 17, 2013, at 10:43 AM, Matthew Cheah  wrote:

> Hi everyone,
> 
> During my most recent internship, I worked extensively with Apache Spark,
> integrating it into a company's data analytics platform. I've now become
> interested in contributing to Apache Spark.
> 
> I'm returning to undergraduate studies in January and there is an academic
> course which is simply a standalone software engineering project. I was
> thinking that some contribution to Apache Spark would satisfy my curiosity,
> help continue support the company I interned at, and give me academic
> credits required to graduate, all at the same time. It seems like too good
> an opportunity to pass up.
> 
> With that in mind, I have the following questions:
> 
>   1. At this point, is there any self-contained project that I could work
>   on within Spark? Ideally, I would work on it independently, in about a
>   three month time frame. This time also needs to accommodate ramping up on
>   the Spark codebase and adjusting to the Scala programming language and
>   paradigms. The company I worked at primarily used the Java APIs. The output
>   needs to be a technical report describing the project requirements, and the
>   design process I took to engineer the solution for the requirements. In
>   particular, it cannot just be a series of haphazard patches.
>   2. How can I get started with contributing to Spark?
>   3. Is there a high-level UML or some other design specification for the
>   Spark architecture?
> 
> Thanks! I hope to be of some help =)
> 
> -Matt Cheah



View bound deprecation (Scala 2.11+)

2013-12-19 Thread Marek Kolodziej
All,

Apparently view bounds will be deprecated going forward. Hopefully they'll
be around for a while after deprecation, but I wanted to raise this issue
for consideration. Here's the SIP:
https://issues.scala-lang.org/browse/SI-7629

Shall I file a Jira for that?

Thanks!

Marek