Re: Propose to Re-organize the scripts and configurations

2013-09-25 Thread Evan Chan
Shane, and others,

Let's work together on the configuration thing.   I had proposed in a
separate thread to use Typesafe Config to hold all configuration
(essentially a configuration class, but which can read from both JSON files
as well as -D java command line args).

Typesafe Config works much much better than a simple config class, and also
better than Hadoop configs.  It also has advantages over JSON (more
readable, comments).   It would also be the easiest to transition from the
current scheme, since the current java system properties can be seamlessly
integrated.

I would be happy to contribute this back soon because it is also a big pain
point for us.  I also have extensive experience with both Typesafe Config
and other config systems.

I would definitely start with SparkContext and work our way out from there.
   In fact I can submit a patch for everyone to test out fairly quickly
just for SparkContext.

-Evan



On Tue, Sep 24, 2013 at 10:26 PM, Shane Huang wrote:

> I think it's good to have Bigtop to package Spark. But in this track we're
> just targeting enhancing the usability of Spark itself without Bigtop.
>  After all, few of our customers used Bigtop.
>
>
> On Wed, Sep 25, 2013 at 1:20 PM, Shane Huang  >wrote:
>
> > I think it's good to have Bigtop to package Spark. But I
> >
> >
> > On Wed, Sep 25, 2013 at 1:16 PM, Konstantin Boudnik  >wrote:
> >
> >> Late to the game, but... Bigtop is packaging Spark now as a part of the
> >> standard distribution - our release 0.7.0 is around the corner. And we
> do
> >> it
> >> in the same way that has been done for Hadoop. Perhaps it worth looking
> >> into...
> >>
> >> Lemme know if you have any questions,
> >>   Cos
> >>
> >> On Sun, Sep 22, 2013 at 12:07PM, Shane Huang wrote:
> >> > And I created a new issue SPARK-915 to track the re-org of scripts as
> >> > SPARK-544 only talks about Config.
> >> > https://spark-project.atlassian.net/browse/SPARK-915
> >> >
> >> >
> >> > On Wed, Sep 18, 2013 at 1:42 AM, Matei Zaharia <
> matei.zaha...@gmail.com
> >> >wrote:
> >> >
> >> > > Hi Shane,
> >> > >
> >> > > I agree with all these points. Improving the configuration system is
> >> one
> >> > > of the main things I'd like to have in the next release.
> >> > >
> >> > > > 1) Usually the application developers/users and platform
> >> administrators
> >> > > > belongs to two teams. So it's better to separate the scripts used
> by
> >> > > > administrators and application users, e.g. put them in sbin and
> bin
> >> > > folders
> >> > > > respectively
> >> > >
> >> > > Yup, right now we don't have any attempt to install on standard
> system
> >> > > paths.
> >> > >
> >> > > > 3) If there are multiple ways to specify an option, an overriding
> >> rule
> >> > > > should be present and should not be error-prone.
> >> > >
> >> > > Yes, I think this should always be Configuration class in code >
> >> system
> >> > > properties > env vars. Over time we will deprecate the env vars and
> >> maybe
> >> > > even system properties.
> >> > >
> >> > > > 4) Currently the options are set and get using System property.
> >> It's hard
> >> > > > to manage and inconvenient for users. It's good to gather the
> >> options
> >> > > into
> >> > > > one file using format like xml or json.
> >> > >
> >> > > I think this is the main thing to do first -- pick one configuration
> >> class
> >> > > and change the code to use this.
> >> > >
> >> > > > Our rough proposal:
> >> > > >
> >> > > >   - Scripts
> >> > > >
> >> > > >   1. make an "sbin" folder containing all the scripts for
> >> administrators,
> >> > > >   specifically,
> >> > > >  - all service administration scripts, i.e. start-*, stop-*,
> >> > > >  slaves.sh, *-daemons, *-daemon scripts
> >> > > >  - low-level or internally used utility scripts, i.e.
> >> > > >  compute-classpath, spark-config, spark-class, spark-executor
> >> > > >   2. make a "bin" folder containing all the scripts for
> application
> >> > > >   developers/users, specifically,
> >> > > >  - user level app  running scripts, i.e. pyspark, spark-shell,
> >> and we
> >> > > >  propose to add a script "spark" for users to run applications
> >> (very
> >> > > much
> >> > > >  like spark-class but may add some more control or convenient
> >> > > utilities)
> >> > > >  - scripts for status checking, e.g. spark and hadoop version
> >> > > >  checking, running applications checking, etc. We can make
> this
> >> a
> >> > > separate
> >> > > >  script or add functionality to "spark" script.
> >> > > >   3. No wandering scripts outside the sbin and bin folders
> >> > >
> >> > > Makes sense.
> >> > >
> >> > > >   -  Configurations/Options and overriding rule
> >> > > >
> >> > > >   1. Define a Configuration class which contains all the options
> >> > > available
> >> > > >   for Spark application. A Configuration instance can be
> >> de-/serialized
> >> > > >   from/to a json formatted file.
> >> > > >   2. Each application (SparkC

Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Patrick Wendell
Yep, we definitely need to just directly point people the location at
apache.org where they can find the hashes. I just updated the release
notes and downloads page to point to that site.

I just wanted to point out that mirroring these through a CDN seems
philosophically the same as mirroring through Apache, since in neither
case do we expect the users to trust the artifact they download. We
just need to be more explicit that we are, indeed, mirroring and
explain that the trusted root is at apache.org

- Patrick

On Wed, Sep 25, 2013 at 3:56 PM, Roman Shaposhnik  wrote:
> On Wed, Sep 25, 2013 at 3:48 PM, Patrick Wendell  wrote:
>> Hey we've actually distributed our artifacts through amazon cloudfront
>> in the past (and that is where the website links redirect to).
>>
>> Since the apache mirrors don't distribute signatures anyways,
>
> True, but apache dist does. IOW, it is not uncommon for those
> having an automated build/fetching systems to get bits from
> one of the mirrors and then get the hashes directly from dist.
>
> In your current case, I don't think I know of a way to do that.
>
> Now, you may say that the current CDN you guys are you using
> is functioning like a mirror -- well, I'd say that it needs to be
> called out like one then.
>
> Otherwise, as a naive user I *really* have to guess where
> to get the hashes.
>
>> what is the difference between linking to an apache mirror vs using a more
>> robust CDN? If people want to verify the downloads they need to go to
>> the apache root in either case.
>>
>> Is this just a cultural thing or is there some security reason?
>
> A bit of both I guess.
>
> Thanks,
> Roman.


Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Roman Shaposhnik
On Wed, Sep 25, 2013 at 3:48 PM, Patrick Wendell  wrote:
> Hey we've actually distributed our artifacts through amazon cloudfront
> in the past (and that is where the website links redirect to).
>
> Since the apache mirrors don't distribute signatures anyways,

True, but apache dist does. IOW, it is not uncommon for those
having an automated build/fetching systems to get bits from
one of the mirrors and then get the hashes directly from dist.

In your current case, I don't think I know of a way to do that.

Now, you may say that the current CDN you guys are you using
is functioning like a mirror -- well, I'd say that it needs to be
called out like one then.

Otherwise, as a naive user I *really* have to guess where
to get the hashes.

> what is the difference between linking to an apache mirror vs using a more
> robust CDN? If people want to verify the downloads they need to go to
> the apache root in either case.
>
> Is this just a cultural thing or is there some security reason?

A bit of both I guess.

Thanks,
Roman.


Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Patrick Wendell
Hey we've actually distributed our artifacts through amazon cloudfront
in the past (and that is where the website links redirect to).

Since the apache mirrors don't distribute signatures anyways, what is
the difference between linking to an apache mirror vs using a more
robust CDN? If people want to verify the downloads they need to go to
the apache root in either case.

Is this just a cultural thing or is there some security reason?

- Patrick

On Wed, Sep 25, 2013 at 3:45 PM, Roman Shaposhnik  wrote:
> On Wed, Sep 25, 2013 at 3:40 PM, Henry Saputra  
> wrote:
>> Was there announcement that 0.8 artifact had been pushed to
>> http://www.apache.org/dist/incubator/spark ?
>>
>> I thought the link should points to
>> http://www.apache.org/dist/incubator/spark/spark-0.8.0-incubating/spark-0.8.0-incubating.tgz
>
> For the freshly released bits it is typically better to point to 
> dyn/closer.cgi
> unless you want to start building negative karma with ASF infra ;-)
>
> Thanks,
> Roman.


Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Roman Shaposhnik
On Wed, Sep 25, 2013 at 3:40 PM, Henry Saputra  wrote:
> Was there announcement that 0.8 artifact had been pushed to
> http://www.apache.org/dist/incubator/spark ?
>
> I thought the link should points to
> http://www.apache.org/dist/incubator/spark/spark-0.8.0-incubating/spark-0.8.0-incubating.tgz

For the freshly released bits it is typically better to point to dyn/closer.cgi
unless you want to start building negative karma with ASF infra ;-)

Thanks,
Roman.


Re: Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Henry Saputra
Was there announcement that 0.8 artifact had been pushed to
http://www.apache.org/dist/incubator/spark ?

I thought the link should points to
http://www.apache.org/dist/incubator/spark/spark-0.8.0-incubating/spark-0.8.0-incubating.tgz


- Henry

On Wed, Sep 25, 2013 at 3:24 PM, Roman Shaposhnik  wrote:
> Hi!
>
> I see that the current download link published here:
>   http://spark.incubator.apache.org/releases/spark-release-0-8-0.html
> leads to:
>   http://spark-project.org/download/spark-0.8.0-incubating.tgz
>
> This needs to be corrected to be (roughly):
>
> http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.8.0-incubating/spark-0.8.0-incubating.tgz
>
> In fact, at some point it may be worth auditing your website
> source and eliminate references to spark-project.org that
> should really be pointing back to ASF.
>
> Thanks,
> Roman.


Spark 0.8.0: bits need to come from ASF infrastructure

2013-09-25 Thread Roman Shaposhnik
Hi!

I see that the current download link published here:
  http://spark.incubator.apache.org/releases/spark-release-0-8-0.html
leads to:
  http://spark-project.org/download/spark-0.8.0-incubating.tgz

This needs to be corrected to be (roughly):
   
http://www.apache.org/dyn/closer.cgi/incubator/spark/spark-0.8.0-incubating/spark-0.8.0-incubating.tgz

In fact, at some point it may be worth auditing your website
source and eliminate references to spark-project.org that
should really be pointing back to ASF.

Thanks,
Roman.


Re: Spark Streaming threading model

2013-09-25 Thread Tathagata Das
On Wed, Sep 25, 2013 at 12:30 PM, Gerard Maas  wrote:

> Hi Tathagata,
>
> Many thanks for the extended answer and the clarifications on the kafka
> data distribution in the cluster.
>
> There are many points to handle, so, to start somewhere:
>
> Case (ii) could have been implemented as an actor as it just inserts a
> >
> > record on an arraybuffer (i.e.m very small task). However, with rates of
> > more than 100K records received per second, I was unsure what the
> overhead
> > of sending each record as a message through the actor library would be
> > like.
> >
> > I'm personally curious about this point. I could investigate by creating
> a
> simplified test scenario that isolates the data cummulator case and compare
> the performance of both models (actors vs threads with proper locking)
> under different levels of concurrency.
> Do you think this could be helpful for the project? I'm looking to
> contribute and this could be an interesting starting point.
>
>
Yes! actor vs threads with locking is a great test to do, since for the
kafka (and who know what other sources in future), the block generator has
to support multiple thread ingestion. I think one also needs to compare
with single thread without locking (the current model). If single thread
without locking is the fastest and thread with locking not so bad compared
to actors, then it may be better to leave the ingestion without locks for
maximum throughput for single-thread sources (e.g. Socket, and most others)
and add a lock for multi-thread sources like Kafka.



> >>I probably went into more detail that you wanted to know. :)
> Absolutely not. The more, the better :-)
>
> -kr, Gerard.
>


Re: Spark Streaming threading model

2013-09-25 Thread Gerard Maas
Hi Tathagata,

Many thanks for the extended answer and the clarifications on the kafka
data distribution in the cluster.

There are many points to handle, so, to start somewhere:

Case (ii) could have been implemented as an actor as it just inserts a
>
> record on an arraybuffer (i.e.m very small task). However, with rates of
> more than 100K records received per second, I was unsure what the overhead
> of sending each record as a message through the actor library would be
> like.
>
> I'm personally curious about this point. I could investigate by creating a
simplified test scenario that isolates the data cummulator case and compare
the performance of both models (actors vs threads with proper locking)
under different levels of concurrency.
Do you think this could be helpful for the project? I'm looking to
contribute and this could be an interesting starting point.

>>I probably went into more detail that you wanted to know. :)
Absolutely not. The more, the better :-)

-kr, Gerard.