Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-02-23 Thread Aaron Kern
+1


Re: [VOTE] Release Apache Spark 1.6.0 (RC4)

2015-12-22 Thread Aaron Davidson
+1

On Tue, Dec 22, 2015 at 7:01 PM, Josh Rosen 
wrote:

> +1
>
> On Tue, Dec 22, 2015 at 7:00 PM, Jeff Zhang  wrote:
>
>> +1
>>
>> On Wed, Dec 23, 2015 at 7:36 AM, Mark Hamstra 
>> wrote:
>>
>>> +1
>>>
>>> On Tue, Dec 22, 2015 at 12:10 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
 version 1.6.0!

 The vote is open until Friday, December 25, 2015 at 18:00 UTC and
 passes if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.6.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/

 The tag to be voted on is *v1.6.0-rc4
 (4062cda3087ae42c6c3cb24508fc1d3a931accdf)
 *

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1176/

 The test repository (versioned as v1.6.0-rc4) for this release can be
 found at:
 https://repository.apache.org/content/repositories/orgapachespark-1175/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc4-docs/

 ===
 == How can I help test this release? ==
 ===
 If you are a Spark user, you can help us test this release by taking an
 existing Spark workload and running on this release candidate, then
 reporting any regressions.

 
 == What justifies a -1 vote for this release? ==
 
 This vote is happening towards the end of the 1.6 QA period, so -1
 votes should only occur for significant regressions from 1.5. Bugs already
 present in 1.5, minor regressions, or bugs related to new features will not
 block this release.

 ===
 == What should happen to JIRA tickets still targeting 1.6.0? ==
 ===
 1. It is OK for documentation patches to target 1.6.0 and still go into
 branch-1.6, since documentations will be published separately from the
 release.
 2. New features for non-alpha-modules should target 1.7+.
 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
 target version.


 ==
 == Major changes to help you focus your testing ==
 ==

 Notable changes since 1.6 RC3

   - SPARK-12404 - Fix serialization error for Datasets with
 Timestamps/Arrays/Decimal
   - SPARK-12218 - Fix incorrect pushdown of filters to parquet
   - SPARK-12395 - Fix join columns of outer join for DataFrame using
   - SPARK-12413 - Fix mesos HA

 Notable changes since 1.6 RC2
 - SPARK_VERSION has been set correctly
 - SPARK-12199 ML Docs are publishing correctly
 - SPARK-12345 Mesos cluster mode has been fixed

 Notable changes since 1.6 RC1
 Spark Streaming

- SPARK-2629  
trackStateByKey has been renamed to mapWithState

 Spark SQL

- SPARK-12165 
SPARK-12189  Fix
bugs in eviction of storage memory by execution.
- SPARK-12258  
 correct
passing null into ScalaUDF

 Notable Features Since 1.5Spark SQL

- SPARK-11787  
 Parquet
Performance - Improve Parquet scan performance when using flat
schemas.
- SPARK-10810 
Session Management - Isolated devault database (i.e USE mydb) even
on shared clusters.
- SPARK-   Dataset
API - A type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation (i.e. Project
Tungsten).
- SPARK-1  
 Unified
Memory Management - Shared memory for execution and caching instead
of exclusive division of the regions.

Re: Update to Spar Mesos docs possibly? LIBPROCESS_IP needs to be set for client mode

2015-12-16 Thread Aaron
Wrt to PR, sure, let me update the documentation, i'll send it out
shortly.  My Fork is on Github..is the PR from there ok?

Cheers,
Aaron

On Wed, Dec 16, 2015 at 11:33 AM, Timothy Chen  wrote:
> Yes if want to manually override what IP to use to be contacted by the
> master you can set LIPROCESS_IP and LIBPROCESS_PORT.
>
> It is a Mesos specific settings. We can definitely update the docs.
>
> Note that in the future as we move to use the new Mesos Http API these
> configurations won't be needed (also libmesos!).
>
> Tim
>
> On Dec 16, 2015, at 8:09 AM, Iulian Dragoș 
> wrote:
>
> LIBPROCESS_IP has zero hits in the Spark code base. This seems to be a
> Mesos-specific setting.
>
> Have you tried setting SPARK_LOCAL_IP?
>
> On Wed, Dec 16, 2015 at 5:07 PM, Aaron  wrote:
>>
>> Found this thread that talked about it to help understand it better:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%3ccajq68qf9pejgnwomasm2dqchyaxpcaovnfkfgggxxpzj2jo...@mail.gmail.com%3E
>>
>> >
>> > When you run Spark on Mesos it needs to run
>> >
>> > spark driver
>> > mesos scheduler
>> >
>> > and both need to be visible to outside world on public iface IP
>> >
>> > you need to tell Spark and Mesos on which interface to bind - by default
>> > they resolve node hostname to ip - this is loopback address in your case
>> >
>> > Possible solutions - on slave node with public IP 192.168.56.50
>> >
>> > 1. Set
>> >
>> >export LIBPROCESS_IP=192.168.56.50
>> >export SPARK_LOCAL_IP=192.168.56.50
>> >
>> > 2. Ensure your hostname resolves to public iface IP - (for testing) edit
>> > /etc/hosts to resolve your domain name to 192.168.56.50
>> > 3. Set correct hostname/ip in mesos configuration - see Nikolaos answer
>> >
>>
>> Cheers,
>> Aaron
>>
>> On Wed, Dec 16, 2015 at 11:00 AM, Iulian Dragoș
>>  wrote:
>> > Hi Aaron,
>> >
>> > I never had to use that variable. What is it for?
>> >
>> > On Wed, Dec 16, 2015 at 2:00 PM, Aaron  wrote:
>> >>
>> >> In going through running various Spark jobs, both Spark 1.5.2 and the
>> >> new Spark 1.6 SNAPSHOTs, on a Mesos cluster (currently 0.25), we
>> >> noticed that is in order to run the Spark shells (both python and
>> >> scala), we needed to set the LIBPROCESS_IP environment variable before
>> >> running.
>> >>
>> >> Was curious if the Spark on Mesos docs should be updated, under the
>> >> Client Mode section, to include setting this environment variable?
>> >>
>> >> Cheers
>> >> Aaron
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> >
>> > --
>> > Iulian Dragos
>> >
>> > --
>> > Reactive Apps on the JVM
>> > www.typesafe.com
>> >
>
>
>
>
> --
>
> --
> Iulian Dragos
>
> --
> Reactive Apps on the JVM
> www.typesafe.com
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Update to Spar Mesos docs possibly? LIBPROCESS_IP needs to be set for client mode

2015-12-16 Thread Aaron
Basically, my hostname doesn't resolve to an "accessible" IP
address...which isn't a big deal, I normally set SPARK_LOCAL_IP when I
am doing things on a YARN cluster.  But, we've moved to a Mesos
Cluster recently, and had to track down when it wasn't working...I
assumed (badly obviously) that setting SPARK_LOCAL_IP was
sufficient...need to tell the Mesos scheduler as well I guess.

Not sure if you if would be a good idea to put in actual code,
something like:  "when SPARK_LCOAL_IP is set, and using mesos:// as
the master..set LIBPROCESS_IP," but, some kind of documentation about
this possible issue would have saved me some time.

Cheers,
Aaron

On Wed, Dec 16, 2015 at 11:07 AM, Aaron  wrote:
> Found this thread that talked about it to help understand it better:
>
> https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%3ccajq68qf9pejgnwomasm2dqchyaxpcaovnfkfgggxxpzj2jo...@mail.gmail.com%3E
>
>>
>> When you run Spark on Mesos it needs to run
>>
>> spark driver
>> mesos scheduler
>>
>> and both need to be visible to outside world on public iface IP
>>
>> you need to tell Spark and Mesos on which interface to bind - by default
>> they resolve node hostname to ip - this is loopback address in your case
>>
>> Possible solutions - on slave node with public IP 192.168.56.50
>>
>> 1. Set
>>
>>export LIBPROCESS_IP=192.168.56.50
>>export SPARK_LOCAL_IP=192.168.56.50
>>
>> 2. Ensure your hostname resolves to public iface IP - (for testing) edit
>> /etc/hosts to resolve your domain name to 192.168.56.50
>> 3. Set correct hostname/ip in mesos configuration - see Nikolaos answer
>>
>
> Cheers,
> Aaron
>
> On Wed, Dec 16, 2015 at 11:00 AM, Iulian Dragoș
>  wrote:
>> Hi Aaron,
>>
>> I never had to use that variable. What is it for?
>>
>> On Wed, Dec 16, 2015 at 2:00 PM, Aaron  wrote:
>>>
>>> In going through running various Spark jobs, both Spark 1.5.2 and the
>>> new Spark 1.6 SNAPSHOTs, on a Mesos cluster (currently 0.25), we
>>> noticed that is in order to run the Spark shells (both python and
>>> scala), we needed to set the LIBPROCESS_IP environment variable before
>>> running.
>>>
>>> Was curious if the Spark on Mesos docs should be updated, under the
>>> Client Mode section, to include setting this environment variable?
>>>
>>> Cheers
>>> Aaron
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>
>>
>>
>> --
>>
>> --
>> Iulian Dragos
>>
>> --
>> Reactive Apps on the JVM
>> www.typesafe.com
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Update to Spar Mesos docs possibly? LIBPROCESS_IP needs to be set for client mode

2015-12-16 Thread Aaron
Found this thread that talked about it to help understand it better:

https://mail-archives.apache.org/mod_mbox/mesos-user/201507.mbox/%3ccajq68qf9pejgnwomasm2dqchyaxpcaovnfkfgggxxpzj2jo...@mail.gmail.com%3E

>
> When you run Spark on Mesos it needs to run
>
> spark driver
> mesos scheduler
>
> and both need to be visible to outside world on public iface IP
>
> you need to tell Spark and Mesos on which interface to bind - by default
> they resolve node hostname to ip - this is loopback address in your case
>
> Possible solutions - on slave node with public IP 192.168.56.50
>
> 1. Set
>
>export LIBPROCESS_IP=192.168.56.50
>export SPARK_LOCAL_IP=192.168.56.50
>
> 2. Ensure your hostname resolves to public iface IP - (for testing) edit
> /etc/hosts to resolve your domain name to 192.168.56.50
> 3. Set correct hostname/ip in mesos configuration - see Nikolaos answer
>

Cheers,
Aaron

On Wed, Dec 16, 2015 at 11:00 AM, Iulian Dragoș
 wrote:
> Hi Aaron,
>
> I never had to use that variable. What is it for?
>
> On Wed, Dec 16, 2015 at 2:00 PM, Aaron  wrote:
>>
>> In going through running various Spark jobs, both Spark 1.5.2 and the
>> new Spark 1.6 SNAPSHOTs, on a Mesos cluster (currently 0.25), we
>> noticed that is in order to run the Spark shells (both python and
>> scala), we needed to set the LIBPROCESS_IP environment variable before
>> running.
>>
>> Was curious if the Spark on Mesos docs should be updated, under the
>> Client Mode section, to include setting this environment variable?
>>
>> Cheers
>> Aaron
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
>
>
> --
>
> --
> Iulian Dragos
>
> --
> Reactive Apps on the JVM
> www.typesafe.com
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Update to Spar Mesos docs possibly? LIBPROCESS_IP needs to be set for client mode

2015-12-16 Thread Aaron
The more I read/look into this..it's because the Spark Mesos Scheduler
resolves to something can't be reached (e.g. localhost).  So, maybe
this just needs to be added to the docs, if your host cannot or does
not resolve to the IP address you want.  Maybe just a footnote or
something?



On Wed, Dec 16, 2015 at 8:00 AM, Aaron  wrote:
> In going through running various Spark jobs, both Spark 1.5.2 and the
> new Spark 1.6 SNAPSHOTs, on a Mesos cluster (currently 0.25), we
> noticed that is in order to run the Spark shells (both python and
> scala), we needed to set the LIBPROCESS_IP environment variable before
> running.
>
> Was curious if the Spark on Mesos docs should be updated, under the
> Client Mode section, to include setting this environment variable?
>
> Cheers
> Aaron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Update to Spar Mesos docs possibly? LIBPROCESS_IP needs to be set for client mode

2015-12-16 Thread Aaron
In going through running various Spark jobs, both Spark 1.5.2 and the
new Spark 1.6 SNAPSHOTs, on a Mesos cluster (currently 0.25), we
noticed that is in order to run the Spark shells (both python and
scala), we needed to set the LIBPROCESS_IP environment variable before
running.

Was curious if the Spark on Mesos docs should be updated, under the
Client Mode section, to include setting this environment variable?

Cheers
Aaron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



-Phive-thriftserver when compiling for use in pyspark and JDBC connections

2015-07-21 Thread Aaron
I compile/make a distribution, with either the 1.4 branch or  master,
using the -Phive-thriftserver, and attempt a JDBC connection to a
mysql DB..using latest connector (5.1.36) jar.

When I setup the pyspark shell doing:

bin/pyspark --jars mysql-connection...jar --driver-class-path
mysql-connector..jar

when I make a data frame from sqlContext.read.jdbc("jdbc://...)

and perhaps I do df.show()

 things seem to work;  but if I compile with out that
-Phive-thriftserver, the same python, in the same settings (spark-env,
spark-defaults), just hangs...never to return.

I am curious, how the hive-thriftserver module plays into this type of
interaction.

Thanks in advance.

Cheers,
Aaron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: External Shuffle service over yarn

2015-06-26 Thread Aaron Davidson
A second advantage is that it allows individual Executors to go into GC
pause (or even crash) and still allow other Executors to read shuffle data
and make progress, which tends to improve stability of memory-intensive
jobs.

On Thu, Jun 25, 2015 at 11:42 PM, Sandy Ryza 
wrote:

> Hi Yash,
>
> One of the main advantages is that, if you turn dynamic allocation on, and
> executors are discarded, your application is still able to get at the
> shuffle data that they wrote out.
>
> -Sandy
>
> On Thu, Jun 25, 2015 at 11:08 PM, yash datta  wrote:
>
>> Hi devs,
>>
>> Can someone point out if there are any distinct advantages of using
>> external shuffle service over yarn (runs on node manager  as an auxiliary
>> service
>>
>> https://issues.apache.org/jira/browse/SPARK-3797)  instead of the
>> default execution in the executor containers ?
>>
>> Please also mention if you have seen any differences having used both
>> ways ?
>>
>> Thanks and Best Regards
>> Yash
>>
>> --
>> When events unfold with calm and ease
>> When the winds that blow are merely breeze
>> Learn from nature, from birds and bees
>> Live your life in love, and let joy not cease.
>>
>
>


Re: hadoop input/output format advanced control

2015-03-25 Thread Aaron Davidson
Should we mention that you should synchronize
on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK to avoid a possible race
condition in cloning Hadoop Configuration objects prior to Hadoop 2.7.0? :)

On Wed, Mar 25, 2015 at 7:16 PM, Patrick Wendell  wrote:

> Great - that's even easier. Maybe we could have a simple example in the
> doc.
>
> On Wed, Mar 25, 2015 at 7:06 PM, Sandy Ryza 
> wrote:
> > Regarding Patrick's question, you can just do "new
> Configuration(oldConf)"
> > to get a cloned Configuration object and add any new properties to it.
> >
> > -Sandy
> >
> > On Wed, Mar 25, 2015 at 4:42 PM, Imran Rashid 
> wrote:
> >
> >> Hi Nick,
> >>
> >> I don't remember the exact details of these scenarios, but I think the
> user
> >> wanted a lot more control over how the files got grouped into
> partitions,
> >> to group the files together by some arbitrary function.  I didn't think
> >> that was possible w/ CombineFileInputFormat, but maybe there is a way?
> >>
> >> thanks
> >>
> >> On Tue, Mar 24, 2015 at 1:50 PM, Nick Pentreath <
> nick.pentre...@gmail.com>
> >> wrote:
> >>
> >> > Imran, on your point to read multiple files together in a partition,
> is
> >> it
> >> > not simpler to use the approach of copy Hadoop conf and set per-RDD
> >> > settings for min split to control the input size per partition,
> together
> >> > with something like CombineFileInputFormat?
> >> >
> >> > On Tue, Mar 24, 2015 at 5:28 PM, Imran Rashid 
> >> > wrote:
> >> >
> >> > > I think this would be a great addition, I totally agree that you
> need
> >> to
> >> > be
> >> > > able to set these at a finer context than just the SparkContext.
> >> > >
> >> > > Just to play devil's advocate, though -- the alternative is for you
> >> just
> >> > > subclass HadoopRDD yourself, or make a totally new RDD, and then you
> >> > could
> >> > > expose whatever you need.  Why is this solution better?  IMO the
> >> criteria
> >> > > are:
> >> > > (a) common operations
> >> > > (b) error-prone / difficult to implement
> >> > > (c) non-obvious, but important for performance
> >> > >
> >> > > I think this case fits (a) & (c), so I think its still worthwhile.
> But
> >> > its
> >> > > also worth asking whether or not its too difficult for a user to
> extend
> >> > > HadoopRDD right now.  There have been several cases in the past week
> >> > where
> >> > > we've suggested that a user should read from hdfs themselves (eg.,
> to
> >> > read
> >> > > multiple files together in one partition) -- with*out* reusing the
> code
> >> > in
> >> > > HadoopRDD, though they would lose things like the metric tracking &
> >> > > preferred locations you get from HadoopRDD.  Does HadoopRDD need to
> >> some
> >> > > refactoring to make that easier to do?  Or do we just need a good
> >> > example?
> >> > >
> >> > > Imran
> >> > >
> >> > > (sorry for hijacking your thread, Koert)
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Mar 23, 2015 at 3:52 PM, Koert Kuipers 
> >> > wrote:
> >> > >
> >> > > > see email below. reynold suggested i send it to dev instead of
> user
> >> > > >
> >> > > > -- Forwarded message --
> >> > > > From: Koert Kuipers 
> >> > > > Date: Mon, Mar 23, 2015 at 4:36 PM
> >> > > > Subject: hadoop input/output format advanced control
> >> > > > To: "u...@spark.apache.org" 
> >> > > >
> >> > > >
> >> > > > currently its pretty hard to control the Hadoop Input/Output
> formats
> >> > used
> >> > > > in Spark. The conventions seems to be to add extra parameters to
> all
> >> > > > methods and then somewhere deep inside the code (for example in
> >> > > > PairRDDFunctions.saveAsHadoopFile) all these parameters get
> >> translated
> >> > > into
> >> > > > settings on the Hadoop Configuration object.
> >> > > >
> >> > > > for example for compression i see "codec: Option[Class[_ <:
> >> > > > CompressionCodec]] = None" added to a bunch of methods.
> >> > > >
> >> > > > how scalable is this solution really?
> >> > > >
> >> > > > for example i need to read from a hadoop dataset and i dont want
> the
> >> > > input
> >> > > > (part) files to get split up. the way to do this is to set
> >> > > > "mapred.min.split.size". now i dont want to set this at the level
> of
> >> > the
> >> > > > SparkContext (which can be done), since i dont want it to apply to
> >> > input
> >> > > > formats in general. i want it to apply to just this one specific
> >> input
> >> > > > dataset i need to read. which leaves me with no options
> currently. i
> >> > > could
> >> > > > go add yet another input parameter to all the methods
> >> > > > (SparkContext.textFile, SparkContext.hadoopFile,
> >> > SparkContext.objectFile,
> >> > > > etc.). but that seems ineffective.
> >> > > >
> >> > > > why can we not expose a Map[String, String] or some other generic
> way
> >> > to
> >> > > > manipulate settings for hadoop input/output formats? it would
> require
> >> > > > adding one more parameter to all methods to deal with hadoop
> >> > input/output
> >> > > > formats, but after that

Re: enum-like types in Spark

2015-03-23 Thread Aaron Davidson
The only issue I knew of with Java enums was that it does not appear in the
Scala documentation.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen  wrote:

> Yeah the fully realized #4, which gets back the ability to use it in
> switch statements (? in Scala but not Java?) does end up being kind of
> huge.
>
> I confess I'm swayed a bit back to Java enums, seeing what it
> involves. The hashCode() issue can be 'solved' with the hash of the
> String representation.
>
> On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid 
> wrote:
> > I've just switched some of my code over to the new format, and I just
> want
> > to make sure everyone realizes what we are getting into.  I went from 10
> > lines as java enums
> >
> >
> https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20
> >
> > to 30 lines with the new format:
> >
> >
> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250
> >
> > its not just that its verbose.  each name has to be repeated 4 times,
> with
> > potential typos in some locations that won't be caught by the compiler.
> > Also, you have to manually maintain the "values" as you update the set of
> > enums, the compiler won't do it for you.
> >
> > The only downside I've heard for java enums is enum.hashcode().  OTOH,
> the
> > downsides for this version are: maintainability / verbosity, no values(),
> > more cumbersome to use from java, no enum map / enumset.
> >
> > I did put together a little util to at least get back the equivalent of
> > enum.valueOf() with this format
> >
> >
> https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala
> >
> > I'm not trying to prevent us from moving forward on this, its fine if
> this
> > is still what everyone wants, but I feel pretty strongly java enums make
> > more sense.
> >
> > thanks,
> > Imran
>


Re: Block Transfer Service encryption support

2015-03-16 Thread Aaron Davidson
Out of curiosity, why could we not use Netty's SslHandler injected into the
TransportContext pipeline?

On Mon, Mar 16, 2015 at 7:56 PM, turp1twin  wrote:

> Hey Patrick,
>
> Sorry for the delay, I was at Elastic{ON} last week and well, my day job
> has
> been keeping me busy... I went ahead and opened a Jira feature request,
> https://issues.apache.org/jira/browse/SPARK-6373. In it I reference a
> commit
> I made in my fork which is a "rough" implementation, definitely still a
> WIP.
> Would like to iterate the design if possible, as there are some performance
> trade offs for using SSL for sure.. Zero copy will not be possible with
> SSL,
> so there will definitely be a hit there.. That being said, for my use case,
> which is health care related and involves processing personal health
> information, I have no choice, as all data must be encrypted in transit and
> at rest... Cheers!
>
> Jeff
>
>
> https://github.com/turp1twin/spark/commit/024b559f27945eb63068d1badf7f82e4e7c3621c
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Block-Transfer-Service-encryption-support-tp10934p11089.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: enum-like types in Spark

2015-03-16 Thread Aaron Davidson
It's unrelated to the proposal, but Enum#ordinal() should be much faster,
assuming it's not serialized to JVMs with different versions of the enum :)

On Mon, Mar 16, 2015 at 12:12 PM, Kevin Markey 
wrote:

> In some applications, I have rather heavy use of Java enums which are
> needed for related Java APIs that the application uses.  And unfortunately,
> they are also used as keys.  As such, using the native hashcodes makes any
> function over keys unstable and unpredictable, so we now use Enum.name() as
> the key instead.  Oh well.  But it works and seems to work well.
>
> Kevin
>
>
> On 03/05/2015 09:49 PM, Mridul Muralidharan wrote:
>
>>I have a strong dislike for java enum's due to the fact that they
>> are not stable across JVM's - if it undergoes serde, you end up with
>> unpredictable results at times [1].
>> One of the reasons why we prevent enum's from being key : though it is
>> highly possible users might depend on it internally and shoot
>> themselves in the foot.
>>
>> Would be better to keep away from them in general and use something more
>> stable.
>>
>> Regards,
>> Mridul
>>
>> [1] Having had to debug this issue for 2 weeks - I really really hate it.
>>
>>
>> On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
>> wrote:
>>
>>> I have a very strong dislike for #1 (scala enumerations).   I'm ok with
>>> #4
>>> (with Xiangrui's final suggestion, especially making it sealed &
>>> available
>>> in Java), but I really think #2, java enums, are the best option.
>>>
>>> Java enums actually have some very real advantages over the other
>>> approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
>>> has
>>> been endless debate in the Scala community about the problems with the
>>> approaches in Scala.  Very smart, level-headed Scala gurus have
>>> complained
>>> about their short-comings (Rex Kerr's name is coming to mind, though I'm
>>> not positive about that); there have been numerous well-thought out
>>> proposals to give Scala a better enum.  But the powers-that-be in Scala
>>> always reject them.  IIRC the explanation for rejecting is basically that
>>> (a) enums aren't important enough for introducing some new special
>>> feature,
>>> scala's got bigger things to work on and (b) if you really need a good
>>> enum, just use java's enum.
>>>
>>> I doubt it really matters that much for Spark internals, which is why I
>>> think #4 is fine.  But I figured I'd give my spiel, because every
>>> developer
>>> loves language wars :)
>>>
>>> Imran
>>>
>>>
>>>
>>> On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng  wrote:
>>>
>>>  `case object` inside an `object` doesn't show up in Java. This is the
>>>> minimal code I found to make everything show up correctly in both
>>>> Scala and Java:
>>>>
>>>> sealed abstract class StorageLevel // cannot be a trait
>>>>
>>>> object StorageLevel {
>>>>private[this] case object _MemoryOnly extends StorageLevel
>>>>final val MemoryOnly: StorageLevel = _MemoryOnly
>>>>
>>>>private[this] case object _DiskOnly extends StorageLevel
>>>>final val DiskOnly: StorageLevel = _DiskOnly
>>>> }
>>>>
>>>> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
>>>> wrote:
>>>>
>>>>> I like #4 as well and agree with Aaron's suggestion.
>>>>>
>>>>> - Patrick
>>>>>
>>>>> On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
>>>>>
>>>> wrote:
>>>>
>>>>> I'm cool with #4 as well, but make sure we dictate that the values
>>>>>>
>>>>> should
>>>>
>>>>> be defined within an object with the same name as the enumeration (like
>>>>>>
>>>>> we
>>>>
>>>>> do for StorageLevel). Otherwise we may pollute a higher namespace.
>>>>>>
>>>>>> e.g. we SHOULD do:
>>>>>>
>>>>>> trait StorageLevel
>>>>>> object StorageLevel {
>>>>>>case object MemoryOnly extends StorageLevel
>>>>>>case object DiskOnly extends StorageLevel
>>>>>> }
>>>>>>
>>>>>> On Wed, Mar

Re: enum-like types in Spark

2015-03-09 Thread Aaron Davidson
Perhaps the problem with Java enums that was brought up was actually that
their hashCode is not stable across JVMs, as it depends on the memory
location of the enum itself.

On Mon, Mar 9, 2015 at 6:15 PM, Imran Rashid  wrote:

> Can you expand on the serde issues w/ java enum's at all?  I haven't heard
> of any problems specific to enums.  The java object serialization rules
> seem very clear and it doesn't seem like different jvms should have a
> choice on what they do:
>
>
> http://docs.oracle.com/javase/6/docs/platform/serialization/spec/serial-arch.html#6469
>
> (in a nutshell, serialization must use enum.name())
>
> of course there are plenty of ways the user could screw this up(eg. rename
> the enums, or change their meaning, or remove them).  But then again, all
> of java serialization has issues w/ serialization the user has to be aware
> of.  Eg., if we go with case objects, than java serialization blows up if
> you add another helper method, even if that helper method is completely
> compatible.
>
> Some prior debate in the scala community:
>
> https://groups.google.com/d/msg/scala-internals/8RWkccSRBxQ/AN5F_ZbdKIsJ
>
> SO post on which version to use in scala:
>
>
> http://stackoverflow.com/questions/1321745/how-to-model-type-safe-enum-types
>
> SO post about the macro-craziness people try to add to scala to make them
> almost as good as a simple java enum:
> (NB: the accepted answer doesn't actually work in all cases ...)
>
>
> http://stackoverflow.com/questions/20089920/custom-scala-enum-most-elegant-version-searched
>
> Another proposal to add better enums built into scala ... but seems to be
> dormant:
>
> https://groups.google.com/forum/#!topic/scala-sips/Bf82LxK02Kk
>
>
>
> On Thu, Mar 5, 2015 at 10:49 PM, Mridul Muralidharan 
> wrote:
>
> >   I have a strong dislike for java enum's due to the fact that they
> > are not stable across JVM's - if it undergoes serde, you end up with
> > unpredictable results at times [1].
> > One of the reasons why we prevent enum's from being key : though it is
> > highly possible users might depend on it internally and shoot
> > themselves in the foot.
> >
> > Would be better to keep away from them in general and use something more
> > stable.
> >
> > Regards,
> > Mridul
> >
> > [1] Having had to debug this issue for 2 weeks - I really really hate it.
> >
> >
> > On Thu, Mar 5, 2015 at 1:08 PM, Imran Rashid 
> wrote:
> > > I have a very strong dislike for #1 (scala enumerations).   I'm ok with
> > #4
> > > (with Xiangrui's final suggestion, especially making it sealed &
> > available
> > > in Java), but I really think #2, java enums, are the best option.
> > >
> > > Java enums actually have some very real advantages over the other
> > > approaches -- you get values(), valueOf(), EnumSet, and EnumMap.  There
> > has
> > > been endless debate in the Scala community about the problems with the
> > > approaches in Scala.  Very smart, level-headed Scala gurus have
> > complained
> > > about their short-comings (Rex Kerr's name is coming to mind, though
> I'm
> > > not positive about that); there have been numerous well-thought out
> > > proposals to give Scala a better enum.  But the powers-that-be in Scala
> > > always reject them.  IIRC the explanation for rejecting is basically
> that
> > > (a) enums aren't important enough for introducing some new special
> > feature,
> > > scala's got bigger things to work on and (b) if you really need a good
> > > enum, just use java's enum.
> > >
> > > I doubt it really matters that much for Spark internals, which is why I
> > > think #4 is fine.  But I figured I'd give my spiel, because every
> > developer
> > > loves language wars :)
> > >
> > > Imran
> > >
> > >
> > >
> > > On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng 
> wrote:
> > >
> > >> `case object` inside an `object` doesn't show up in Java. This is the
> > >> minimal code I found to make everything show up correctly in both
> > >> Scala and Java:
> > >>
> > >> sealed abstract class StorageLevel // cannot be a trait
> > >>
> > >> object StorageLevel {
> > >>   private[this] case object _MemoryOnly extends StorageLevel
> > >>   final val MemoryOnly: StorageLevel = _MemoryOnly
> > >>
> > >>   private[this] case object _DiskOnly extends StorageLev

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Aaron Davidson
Yes, unfortunately that direct dependency makes this injection much more
difficult for saveAsParquetFile.

On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee  wrote:

> Thanks for the DirectOutputCommitter example.
> However I found it only works for saveAsHadoopFile. What about
> saveAsParquetFile?
> It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
> of FileOutputCommitter.
>
> On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor <
> thomas.dem...@amplidata.com>
> wrote:
>
> > FYI. We're currently addressing this at the Hadoop level in
> > https://issues.apache.org/jira/browse/HADOOP-9565
> >
> >
> > Thomas Demoor
> >
> > On Mon, Feb 23, 2015 at 10:16 PM, Darin McBeath <
> > ddmcbe...@yahoo.com.invalid> wrote:
> >
> >> Just to close the loop in case anyone runs into the same problem I had.
> >>
> >> By setting --hadoop-major-version=2 when using the ec2 scripts,
> >> everything worked fine.
> >>
> >> Darin.
> >>
> >>
> >> - Original Message -
> >> From: Darin McBeath 
> >> To: Mingyu Kim ; Aaron Davidson 
> >> Cc: "u...@spark.apache.org" 
> >> Sent: Monday, February 23, 2015 3:16 PM
> >> Subject: Re: Which OutputCommitter to use for S3?
> >>
> >> Thanks.  I think my problem might actually be the other way around.
> >>
> >> I'm compiling with hadoop 2,  but when I startup Spark, using the ec2
> >> scripts, I don't specify a
> >> -hadoop-major-version and the default is 1.   I'm guessing that if I
> make
> >> that a 2 that it might work correctly.  I'll try it and post a response.
> >>
> >>
> >> - Original Message -
> >> From: Mingyu Kim 
> >> To: Darin McBeath ; Aaron Davidson <
> >> ilike...@gmail.com>
> >> Cc: "u...@spark.apache.org" 
> >> Sent: Monday, February 23, 2015 3:06 PM
> >> Subject: Re: Which OutputCommitter to use for S3?
> >>
> >> Cool, we will start from there. Thanks Aaron and Josh!
> >>
> >> Darin, it¹s likely because the DirectOutputCommitter is compiled with
> >> Hadoop 1 classes and you¹re running it with Hadoop 2.
> >> org.apache.hadoop.mapred.JobContext used to be a class in Hadoop 1, and
> it
> >> became an interface in Hadoop 2.
> >>
> >> Mingyu
> >>
> >>
> >>
> >>
> >>
> >> On 2/23/15, 11:52 AM, "Darin McBeath" 
> >> wrote:
> >>
> >> >Aaron.  Thanks for the class. Since I'm currently writing Java based
> >> >Spark applications, I tried converting your class to Java (it seemed
> >> >pretty straightforward).
> >> >
> >> >I set up the use of the class as follows:
> >> >
> >> >SparkConf conf = new SparkConf()
> >> >.set("spark.hadoop.mapred.output.committer.class",
> >> >"com.elsevier.common.DirectOutputCommitter");
> >> >
> >> >And I then try and save a file to S3 (which I believe should use the
> old
> >> >hadoop apis).
> >> >
> >> >JavaPairRDD newBaselineRDDWritable =
> >> >reducedhsfPairRDD.mapToPair(new ConvertToWritableTypes());
> >> >newBaselineRDDWritable.saveAsHadoopFile(baselineOutputBucketFile,
> >> >Text.class, Text.class, SequenceFileOutputFormat.class,
> >> >org.apache.hadoop.io.compress.GzipCodec.class);
> >> >
> >> >But, I get the following error message.
> >> >
> >> >Exception in thread "main" java.lang.IncompatibleClassChangeError:
> Found
> >> >class org.apache.hadoop.mapred.JobContext, but interface was expected
> >> >at
> >>
> >>
> >com.elsevier.common.DirectOutputCommitter.commitJob(DirectOutputCommitter.
> >> >java:68)
> >> >at
> >>
> >org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:127)
> >> >at
> >>
> >>
> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions
> >> >.scala:1075)
> >> >at
> >>
> >>
> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
> >> >ala:940)
> >> >at
> >>
> >>
> >org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.sc
> >> >ala:902)
> >> >at
> >>
> >>
> >org.apache.spark.api.ja

Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
That's kinda annoying, but it's just a little extra boilerplate. Can you
call it as StorageLevel.DiskOnly() from Java? Would it also work if they
were case classes with empty constructors, without the field?

On Wed, Mar 4, 2015 at 11:35 PM, Xiangrui Meng  wrote:

> `case object` inside an `object` doesn't show up in Java. This is the
> minimal code I found to make everything show up correctly in both
> Scala and Java:
>
> sealed abstract class StorageLevel // cannot be a trait
>
> object StorageLevel {
>   private[this] case object _MemoryOnly extends StorageLevel
>   final val MemoryOnly: StorageLevel = _MemoryOnly
>
>   private[this] case object _DiskOnly extends StorageLevel
>   final val DiskOnly: StorageLevel = _DiskOnly
> }
>
> On Wed, Mar 4, 2015 at 8:10 PM, Patrick Wendell 
> wrote:
> > I like #4 as well and agree with Aaron's suggestion.
> >
> > - Patrick
> >
> > On Wed, Mar 4, 2015 at 6:07 PM, Aaron Davidson 
> wrote:
> >> I'm cool with #4 as well, but make sure we dictate that the values
> should
> >> be defined within an object with the same name as the enumeration (like
> we
> >> do for StorageLevel). Otherwise we may pollute a higher namespace.
> >>
> >> e.g. we SHOULD do:
> >>
> >> trait StorageLevel
> >> object StorageLevel {
> >>   case object MemoryOnly extends StorageLevel
> >>   case object DiskOnly extends StorageLevel
> >> }
> >>
> >> On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust <
> mich...@databricks.com>
> >> wrote:
> >>
> >>> #4 with a preference for CamelCaseEnums
> >>>
> >>> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
> >>> wrote:
> >>>
> >>> > another vote for #4
> >>> > People are already used to adding "()" in Java.
> >>> >
> >>> >
> >>> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 
> >>> wrote:
> >>> >
> >>> > > #4 but with MemoryOnly (more scala-like)
> >>> > >
> >>> > > http://docs.scala-lang.org/style/naming-conventions.html
> >>> > >
> >>> > > Constants, Values, Variable and Methods
> >>> > >
> >>> > > Constant names should be in upper camel case. That is, if the
> member is
> >>> > > final, immutable and it belongs to a package object or an object,
> it
> >>> may
> >>> > be
> >>> > > considered a constant (similar to Java'sstatic final members):
> >>> > >
> >>> > >
> >>> > >1. object Container {
> >>> > >2. val MyConstant = ...
> >>> > >3. }
> >>> > >
> >>> > >
> >>> > > 2015-03-04 17:11 GMT-08:00 Xiangrui Meng :
> >>> > >
> >>> > > > Hi all,
> >>> > > >
> >>> > > > There are many places where we use enum-like types in Spark, but
> in
> >>> > > > different ways. Every approach has both pros and cons. I wonder
> >>> > > > whether there should be an "official" approach for enum-like
> types in
> >>> > > > Spark.
> >>> > > >
> >>> > > > 1. Scala's Enumeration (e.g., SchedulingMode, WorkerState, etc)
> >>> > > >
> >>> > > > * All types show up as Enumeration.Value in Java.
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
> >>> > > >
> >>> > > > 2. Java's Enum (e.g., SaveMode, IOMode)
> >>> > > >
> >>> > > > * Implementation must be in a Java file.
> >>> > > > * Values doesn't show up in the ScalaDoc:
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
> >>> > > >
> >>> > > > 3. Static fields in Java (e.g., TripletFields)
> >>> > > >
> >>> > > > * Implementation must be in a Java file.
> >>> > > > * Doesn't need "()" in Java code.
> >>> > > > * Values don't show up in the ScalaDoc:
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
> >>> > > >
> >>> > > > 4. Objects in Scala. (e.g., StorageLevel)
> >>> > > >
> >>> > > > * Needs "()" in Java code.
> >>> > > > * Values show up in both ScalaDoc and JavaDoc:
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
> >>> > > >
> >>> > > > It would be great if we have an "official" approach for this as
> well
> >>> > > > as the naming convention for enum-like values ("MEMORY_ONLY" or
> >>> > > > "MemoryOnly"). Personally, I like 4) with "MEMORY_ONLY". Any
> >>> thoughts?
> >>> > > >
> >>> > > > Best,
> >>> > > > Xiangrui
> >>> > > >
> >>> > > >
> -
> >>> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
>


Re: enum-like types in Spark

2015-03-04 Thread Aaron Davidson
I'm cool with #4 as well, but make sure we dictate that the values should
be defined within an object with the same name as the enumeration (like we
do for StorageLevel). Otherwise we may pollute a higher namespace.

e.g. we SHOULD do:

trait StorageLevel
object StorageLevel {
  case object MemoryOnly extends StorageLevel
  case object DiskOnly extends StorageLevel
}

On Wed, Mar 4, 2015 at 5:37 PM, Michael Armbrust 
wrote:

> #4 with a preference for CamelCaseEnums
>
> On Wed, Mar 4, 2015 at 5:29 PM, Joseph Bradley 
> wrote:
>
> > another vote for #4
> > People are already used to adding "()" in Java.
> >
> >
> > On Wed, Mar 4, 2015 at 5:14 PM, Stephen Boesch 
> wrote:
> >
> > > #4 but with MemoryOnly (more scala-like)
> > >
> > > http://docs.scala-lang.org/style/naming-conventions.html
> > >
> > > Constants, Values, Variable and Methods
> > >
> > > Constant names should be in upper camel case. That is, if the member is
> > > final, immutable and it belongs to a package object or an object, it
> may
> > be
> > > considered a constant (similar to Java’sstatic final members):
> > >
> > >
> > >1. object Container {
> > >2. val MyConstant = ...
> > >3. }
> > >
> > >
> > > 2015-03-04 17:11 GMT-08:00 Xiangrui Meng :
> > >
> > > > Hi all,
> > > >
> > > > There are many places where we use enum-like types in Spark, but in
> > > > different ways. Every approach has both pros and cons. I wonder
> > > > whether there should be an “official” approach for enum-like types in
> > > > Spark.
> > > >
> > > > 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)
> > > >
> > > > * All types show up as Enumeration.Value in Java.
> > > >
> > > >
> > >
> >
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html
> > > >
> > > > 2. Java’s Enum (e.g., SaveMode, IOMode)
> > > >
> > > > * Implementation must be in a Java file.
> > > > * Values doesn’t show up in the ScalaDoc:
> > > >
> > > >
> > >
> >
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode
> > > >
> > > > 3. Static fields in Java (e.g., TripletFields)
> > > >
> > > > * Implementation must be in a Java file.
> > > > * Doesn’t need “()” in Java code.
> > > > * Values don't show up in the ScalaDoc:
> > > >
> > > >
> > >
> >
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields
> > > >
> > > > 4. Objects in Scala. (e.g., StorageLevel)
> > > >
> > > > * Needs “()” in Java code.
> > > > * Values show up in both ScalaDoc and JavaDoc:
> > > >
> > > >
> > >
> >
> http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$
> > > >
> > > >
> > >
> >
> http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html
> > > >
> > > > It would be great if we have an “official” approach for this as well
> > > > as the naming convention for enum-like values (“MEMORY_ONLY” or
> > > > “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any
> thoughts?
> > > >
> > > > Best,
> > > > Xiangrui
> > > >
> > > > -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > >
> > > >
> > >
> >
>


Re: [Performance] Possible regression in rdd.take()?

2015-02-18 Thread Aaron Davidson
You might be seeing the result of this patch:

https://github.com/apache/spark/commit/d069c5d9d2f6ce06389ca2ddf0b3ae4db72c5797

which was introduced in 1.1.1. This patch disabled the ability for take()
to run without launching a Spark job, which means that the latency is
significantly increased for small jobs (but not for large ones). You can
try enabling local execution and seeing if your problem goes away.

On Wed, Feb 18, 2015 at 5:10 PM, Matt Cheah  wrote:

> I actually tested Spark 1.2.0 with the code in the rdd.take() method
> swapped out for what was in Spark 1.0.2. The run time was still slower,
> which indicates to me something at work lower in the stack.
>
> -Matt Cheah
>
> On 2/18/15, 4:54 PM, "Patrick Wendell"  wrote:
>
> >I believe the heuristic governing the way that take() decides to fetch
> >partitions changed between these versions. It could be that in certain
> >cases the new heuristic is worse, but it might be good to just look at
> >the source code and see, for your number of elements taken and number
> >of partitions, if there was any effective change in how aggressively
> >spark fetched partitions.
> >
> >This was quite a while ago, but I think the change was made because in
> >many cases the newer code works more efficiently.
> >
> >- Patrick
> >
> >On Wed, Feb 18, 2015 at 4:47 PM, Matt Cheah  wrote:
> >> Hi everyone,
> >>
> >> Between Spark 1.0.2 and Spark 1.1.1, I have noticed that rdd.take()
> >> consistently has a slower execution time on the later release. I was
> >> wondering if anyone else has had similar observations.
> >>
> >> I have two setups where this reproduces. The first is a local test. I
> >> launched a spark cluster with 4 worker JVMs on my Mac, and launched a
> >> Spark-Shell. I retrieved the text file and immediately called
> >>rdd.take(N) on
> >> it, where N varied. The RDD is a plaintext CSV, 4GB in size, split over
> >>8
> >> files, which ends up having 128 partitions, and a total of 8000
> >>rows.
> >> The numbers I discovered between Spark 1.0.2 and Spark 1.1.1 are, with
> >>all
> >> numbers being in seconds:
> >>
> >> 1 items
> >>
> >> Spark 1.0.2: 0.069281, 0.012261, 0.011083
> >>
> >> Spark 1.1.1: 0.11577, 0.097636, 0.11321
> >>
> >>
> >> 4 items
> >>
> >> Spark 1.0.2: 0.023751, 0.069365, 0.023603
> >>
> >> Spark 1.1.1: 0.224287, 0.229651, 0.158431
> >>
> >>
> >> 10 items
> >>
> >> Spark 1.0.2: 0.047019, 0.049056, 0.042568
> >>
> >> Spark 1.1.1: 0.353277, 0.288965, 0.281751
> >>
> >>
> >> 40 items
> >>
> >> Spark 1.0.2: 0.216048, 0.198049, 0.796037
> >>
> >> Spark 1.1.1: 1.865622, 2.224424, 2.037672
> >>
> >> This small test suite indicates a consistently reproducible performance
> >> regression.
> >>
> >>
> >> I also notice this on a larger scale test. The cluster used is on EC2:
> >>
> >> ec2 instance type: m2.4xlarge
> >> 10 slaves, 1 master
> >> ephemeral storage
> >> 70 cores, 50 GB/box
> >>
> >> In this case, I have a 100GB dataset split into 78 files totally 350
> >>million
> >> items, and I take the first 50,000 items from the RDD. In this case, I
> >>have
> >> tested this on different formats of the raw data.
> >>
> >> With plaintext files:
> >>
> >> Spark 1.0.2: 0.422s, 0.363s, 0.382s
> >>
> >> Spark 1.1.1: 4.54s, 1.28s, 1.221s, 1.13s
> >>
> >>
> >> With snappy-compressed Avro files:
> >>
> >> Spark 1.0.2: 0.73s, 0.395s, 0.426s
> >>
> >> Spark 1.1.1: 4.618s, 1.81s, 1.158s, 1.333s
> >>
> >> Again demonstrating a reproducible performance regression.
> >>
> >> I was wondering if anyone else observed this regression, and if so, if
> >> anyone would have any idea what could possibly have caused it between
> >>Spark
> >> 1.0.2 and Spark 1.1.1?
> >>
> >> Thanks,
> >>
> >> -Matt Cheah
>


Re: Custom Cluster Managers / Standalone Recovery Mode in Spark

2015-02-01 Thread Aaron Davidson
For the specific question of supplementing Standalone Mode with a custom
leader election protocol, this was actually already committed in master and
will be available in Spark 1.3:

https://github.com/apache/spark/pull/771/files

You can specify spark.deploy.recoveryMode = "CUSTOM"
and spark.deploy.recoveryMode.factory to a class which
implements StandaloneRecoveryModeFactory. See the current implementations
of FileSystemRecoveryModeFactory and ZooKeeperRecoveryModeFactory.

I will update the JIRA you linked to be more current.

On Sat, Jan 31, 2015 at 12:55 AM, Anjana Fernando 
wrote:

> Hi everyone,
>
> I've been experimenting, and somewhat of a newbie for Spark. I was
> wondering, if there is any way, that I can use a custom cluster manager
> implementation with Spark. Basically, as I understood, at the moment, the
> inbuilt modes supported are with standalone, Mesos and  Yarn. My
> requirement is basically a simple clustering solution with high
> availability of the master. I don't want to use a separate Zookeeper
> cluster, since this would complicate my deployment, but rather, I would
> like to use something like Hazelcast, which has a peer-to-peer cluster
> coordination implementation.
>
> I found that, there is already this JIRA [1], which requests for a custom
> persistence engine, I guess for storing state information. So basically,
> what I would want to do is, use Hazelcast to use for leader election, to
> make an existing node the master, and to lookup the state information from
> the distributed memory. Appreciate any help on how to archive this. And if
> it useful for a wider audience, hopefully I can contribute this back to the
> project.
>
> [1] https://issues.apache.org/jira/browse/SPARK-1180
>
> Cheers,
> Anjana.
>


Re: Semantics of LGTM

2015-01-17 Thread Aaron Davidson
I think I've seen something like +2 = "strong LGTM" and +1 = "weak LGTM;
someone else should review" before. It's nice to have a shortcut which
isn't a sentence when talking about weaker forms of LGTM.

On Sat, Jan 17, 2015 at 6:59 PM,  wrote:

> I think clarifying these semantics is definitely worthwhile. Maybe this
> complicates the process with additional terminology, but the way I've used
> these has been:
>
> +1 - I think this is safe to merge and, barring objections from others,
> would merge it immediately.
>
> LGTM - I have no concerns about this patch, but I don't necessarily feel
> qualified to make a final call about it.  The TM part acknowledges the
> judgment as a little more subjective.
>
> I think having some concise way to express both of these is useful.
>
> -Sandy
>
> > On Jan 17, 2015, at 5:40 PM, Patrick Wendell  wrote:
> >
> > Hey All,
> >
> > Just wanted to ping about a minor issue - but one that ends up having
> > consequence given Spark's volume of reviews and commits. As much as
> > possible, I think that we should try and gear towards "Google Style"
> > LGTM on reviews. What I mean by this is that LGTM has the following
> > semantics:
> >
> > "I know this code well, or I've looked at it close enough to feel
> > confident it should be merged. If there are issues/bugs with this code
> > later on, I feel confident I can help with them."
> >
> > Here is an alternative semantic:
> >
> > "Based on what I know about this part of the code, I don't see any
> > show-stopper problems with this patch".
> >
> > The issue with the latter is that it ultimately erodes the
> > significance of LGTM, since subsequent reviewers need to reason about
> > what the person meant by saying LGTM. In contrast, having strong
> > semantics around LGTM can help streamline reviews a lot, especially as
> > reviewers get more experienced and gain trust from the comittership.
> >
> > There are several easy ways to give a more limited endorsement of a
> patch:
> > - "I'm not familiar with this code, but style, etc look good" (general
> > endorsement)
> > - "The build changes in this code LGTM, but I haven't reviewed the
> > rest" (limited LGTM)
> >
> > If people are okay with this, I might add a short note on the wiki.
> > I'm sending this e-mail first, though, to see whether anyone wants to
> > express agreement or disagreement with this approach.
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: OutOfMemoryError on parquet SnappyDecompressor

2014-09-23 Thread Aaron Davidson
This may be related: https://github.com/Parquet/parquet-mr/issues/211

Perhaps if we change our configuration settings for Parquet it would get
better, but the performance characteristics of Snappy are pretty bad here
under some circumstances.

On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger  wrote:

> Cool, that's pretty much what I was thinking as far as configuration goes.
>
> Running on Mesos.  Worker nodes are amazon xlarge, so 4 core / 15g.  I've
> tried executor memory sizes as high as 6G
> Default hdfs block size 64m, about 25G of total data written by a job with
> 128 partitions.  The exception comes when trying to read the data (all
> columns).
>
> Schema looks like this:
>
> case class A(
>   a: Long,
>   b: Long,
>   c: Byte,
>   d: Option[Long],
>   e: Option[Long],
>   f: Option[Long],
>   g: Option[Long],
>   h: Option[Int],
>   i: Long,
>   j: Option[Int],
>   k: Seq[Int],
>   l: Seq[Int],
>   m: Seq[Int]
> )
>
> We're just going back to gzip for now, but might be nice to help someone
> else avoid running into this.
>
> On Tue, Sep 23, 2014 at 11:18 AM, Michael Armbrust  >
> wrote:
>
> > I actually submitted a patch to do this yesterday:
> > https://github.com/apache/spark/pull/2493
> >
> > Can you tell us more about your configuration.  In particular how much
> > memory/cores do the executors have and what does the schema of your data
> > look like?
> >
> > On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger 
> > wrote:
> >
> >> So as a related question, is there any reason the settings in SQLConf
> >> aren't read from the spark context's conf?  I understand why the sql
> conf
> >> is mutable, but it's not particularly user friendly to have most spark
> >> configuration set via e.g. defaults.conf or --properties-file, but for
> >> spark sql to ignore those.
> >>
> >> On Mon, Sep 22, 2014 at 4:34 PM, Cody Koeninger 
> >> wrote:
> >>
> >> > After commit 8856c3d8 switched from gzip to snappy as default parquet
> >> > compression codec, I'm seeing the following when trying to read
> parquet
> >> > files saved using the new default (same schema and roughly same size
> as
> >> > files that were previously working):
> >> >
> >> > java.lang.OutOfMemoryError: Direct buffer memory
> >> > java.nio.Bits.reserveMemory(Bits.java:658)
> >> > java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
> >> > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
> >> >
> >> >
> >>
> parquet.hadoop.codec.SnappyDecompressor.setInput(SnappyDecompressor.java:99)
> >> >
> >> >
> >>
> parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:43)
> >> > java.io.DataInputStream.readFully(DataInputStream.java:195)
> >> > java.io.DataInputStream.readFully(DataInputStream.java:169)
> >> >
> >> >
> >>
> parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:201)
> >> >
> >> >
> parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:521)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:493)
> >> >
> >> >
> parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:546)
> >> >
> >> > parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:339)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:63)
> >> >
> >> >
> >>
> parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:58)
> >> >
> >> >
> >>
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:265)
> >> >
> >>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:60)
> >> >
> >>  parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:74)
> >> >
> >> >
> >>
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:110)
> >> >
> >> >
> >>
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:172)
> >> >
> >> >
> >>
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:130)
> >> >
> >> >
> >>
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
> >> >
> >> >
> >>
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> >> > scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
> >> >
> >> >
> >>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:220)
> >> >
> >> >
> >>
> org.apache.spark.sql.execution.ExistingRdd$$anonfun$productToRowRdd$1.apply(basicOperators.scala:219)
> >> > org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.

Your Weekly GPGold Offer Is Waiting Inside

2014-09-06 Thread Aaron Babcock
http://maxigas.rrzconsulting.com/ssnfxezj/fpxihvkqhbkjlwgz.upbuevikyiirz

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
More documentation on this would be undoubtedly useful. Many of the
properties changed or were deprecated in Spark 1.0, and I'm not sure our
current set of documentation via userlists is up to par, since many of the
previous suggestions are deprecated.


On Thu, Jul 24, 2014 at 10:14 AM, Martin Goodson 
wrote:

> Great - thanks for the clarification Aaron. The offer stands for me to
> write some documentation and an example that covers this without leaving
> *any* room for ambiguity.
>
>
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
>
> On Thu, Jul 24, 2014 at 6:09 PM, Aaron Davidson 
> wrote:
>
>> Whoops, I was mistaken in my original post last year. By default, there
>> is one executor per node per Spark Context, as you said.
>> "spark.executor.memory" is the amount of memory that the application
>> requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
>> memory a Spark Worker is willing to allocate in executors.
>>
>> So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your
>> cluster, and spark.executor.memory to 4g, you would be able to run 2
>> simultaneous Spark Contexts who get 4g per node. Similarly, if
>> spark.executor.memory were 8g, you could only run 1 Spark Context at a time
>> on the cluster, but it would get all the cluster's memory.
>>
>>
>> On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson 
>> wrote:
>>
>>> Thank you Nishkam,
>>> I have read your code. So, for the sake of my understanding, it seems
>>> that for each spark context there is one executor per node? Can anyone
>>> confirm this?
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> [image: Inline image 1]
>>>
>>>
>>> On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi 
>>> wrote:
>>>
>>>> See if this helps:
>>>>
>>>> https://github.com/nishkamravi2/SparkAutoConfig/
>>>>
>>>> It's a very simple tool for auto-configuring default parameters in
>>>> Spark. Takes as input high-level parameters (like number of nodes, cores
>>>> per node, memory per node, etc) and spits out default configuration, user
>>>> advice and command line. Compile (javac SparkConfigure.java) and run (java
>>>> SparkConfigure).
>>>>
>>>> Also cc'ing dev in case others are interested in helping evolve this
>>>> over time (by refining the heuristics and adding more parameters).
>>>>
>>>>
>>>>  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson 
>>>> wrote:
>>>>
>>>>> Thanks Andrew,
>>>>>
>>>>> So if there is only one SparkContext there is only one executor per
>>>>> machine? This seems to contradict Aaron's message from the link above:
>>>>>
>>>>> "If each machine has 16 GB of RAM and 4 cores, for example, you might
>>>>> set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
>>>>> Spark.)"
>>>>>
>>>>> Am I reading this incorrectly?
>>>>>
>>>>> Anyway our configuration is 21 machines (one master and 20 slaves)
>>>>> each with 60Gb. We would like to use 4 cores per machine. This is pyspark
>>>>> so we want to leave say 16Gb on each machine for python processes.
>>>>>
>>>>> Thanks again for the advice!
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Martin Goodson  |  VP Data Science
>>>>> (0)20 3397 1240
>>>>> [image: Inline image 1]
>>>>>
>>>>>
>>>>> On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash 
>>>>> wrote:
>>>>>
>>>>>> Hi Martin,
>>>>>>
>>>>>> In standalone mode, each SparkContext you initialize gets its own set
>>>>>> of executors across the cluster.  So for example if you have two shells
>>>>>> open, they'll each get two JVMs on each worker machine in the cluster.
>>>>>>
>>>>>> As far as the other docs, you can configure the total number of cores
>>>>>> requested for the SparkContext, the amount of memory for the executor JVM
>>>>>> on each machine, the amount of memory for the Master/Worker daemons 
>>>>>> (little
>>>>>> needed since work 

Re: Configuring Spark Memory

2014-07-24 Thread Aaron Davidson
Whoops, I was mistaken in my original post last year. By default, there is
one executor per node per Spark Context, as you said.
"spark.executor.memory" is the amount of memory that the application
requests for each of its executors. SPARK_WORKER_MEMORY is the amount of
memory a Spark Worker is willing to allocate in executors.

So if you were to set SPARK_WORKER_MEMORY to 8g everywhere on your cluster,
and spark.executor.memory to 4g, you would be able to run 2 simultaneous
Spark Contexts who get 4g per node. Similarly, if spark.executor.memory
were 8g, you could only run 1 Spark Context at a time on the cluster, but
it would get all the cluster's memory.


On Thu, Jul 24, 2014 at 7:25 AM, Martin Goodson 
wrote:

> Thank you Nishkam,
> I have read your code. So, for the sake of my understanding, it seems that
> for each spark context there is one executor per node? Can anyone confirm
> this?
>
>
> --
> Martin Goodson  |  VP Data Science
> (0)20 3397 1240
> [image: Inline image 1]
>
>
> On Thu, Jul 24, 2014 at 6:12 AM, Nishkam Ravi  wrote:
>
>> See if this helps:
>>
>> https://github.com/nishkamravi2/SparkAutoConfig/
>>
>> It's a very simple tool for auto-configuring default parameters in Spark.
>> Takes as input high-level parameters (like number of nodes, cores per node,
>> memory per node, etc) and spits out default configuration, user advice and
>> command line. Compile (javac SparkConfigure.java) and run (java
>> SparkConfigure).
>>
>> Also cc'ing dev in case others are interested in helping evolve this over
>> time (by refining the heuristics and adding more parameters).
>>
>>
>>  On Wed, Jul 23, 2014 at 8:31 AM, Martin Goodson 
>> wrote:
>>
>>> Thanks Andrew,
>>>
>>> So if there is only one SparkContext there is only one executor per
>>> machine? This seems to contradict Aaron's message from the link above:
>>>
>>> "If each machine has 16 GB of RAM and 4 cores, for example, you might
>>> set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by
>>> Spark.)"
>>>
>>> Am I reading this incorrectly?
>>>
>>> Anyway our configuration is 21 machines (one master and 20 slaves) each
>>> with 60Gb. We would like to use 4 cores per machine. This is pyspark so we
>>> want to leave say 16Gb on each machine for python processes.
>>>
>>> Thanks again for the advice!
>>>
>>>
>>>
>>> --
>>> Martin Goodson  |  VP Data Science
>>> (0)20 3397 1240
>>> [image: Inline image 1]
>>>
>>>
>>> On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash 
>>> wrote:
>>>
>>>> Hi Martin,
>>>>
>>>> In standalone mode, each SparkContext you initialize gets its own set
>>>> of executors across the cluster.  So for example if you have two shells
>>>> open, they'll each get two JVMs on each worker machine in the cluster.
>>>>
>>>> As far as the other docs, you can configure the total number of cores
>>>> requested for the SparkContext, the amount of memory for the executor JVM
>>>> on each machine, the amount of memory for the Master/Worker daemons (little
>>>> needed since work is done in executors), and several other settings.
>>>>
>>>> Which of those are you interested in?  What spec hardware do you have
>>>> and how do you want to configure it?
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Wed, Jul 23, 2014 at 6:10 AM, Martin Goodson 
>>>> wrote:
>>>>
>>>>> We are having difficulties configuring Spark, partly because we still
>>>>> don't understand some key concepts. For instance, how many executors are
>>>>> there per machine in standalone mode? This is after having closely
>>>>> read the documentation several times:
>>>>>
>>>>> *http://spark.apache.org/docs/latest/configuration.html
>>>>> <http://spark.apache.org/docs/latest/configuration.html>*
>>>>> *http://spark.apache.org/docs/latest/spark-standalone.html
>>>>> <http://spark.apache.org/docs/latest/spark-standalone.html>*
>>>>> *http://spark.apache.org/docs/latest/tuning.html
>>>>> <http://spark.apache.org/docs/latest/tuning.html>*
>>>>> *http://spark.apache.org/docs/latest/cluster-overview.html
>>>>> <http://spark.apache.org/docs/latest/cluster-overview.html>*
>>>>>
>>>>

Re: ec2 clusters launched at 9fe693b5b6 are broken (?)

2014-07-14 Thread Aaron Davidson
This one is typically due to a mismatch between the Hadoop versions --
i.e., Spark is compiled against 1.0.4 but is running with 2.3.0 in the
classpath, or something like that. Not certain why you're seeing this with
spark-ec2, but I'm assuming this is related to the issues you posted in a
separate thread.


On Mon, Jul 14, 2014 at 6:43 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Just launched an EC2 cluster from git hash
> 9fe693b5b6ed6af34ee1e800ab89c8a11991ea38. Calling take() on an RDD
> accessing data in S3 yields the following error output.
>
> I understand that NoClassDefFoundError errors may mean something in the
> deployment was messed up. Is that correct? When I launch a cluster using
> spark-ec2, I expect all critical deployment details to be taken care of by
> the script.
>
> So is something in the deployment executed by spark-ec2 borked?
>
> Nick
>
> java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:224)
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:214)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
> at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
> at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:176)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
> at
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
> at org.apache.spark.ShuffleDependency.(Dependency.scala:71)
> at
> org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:79)
> at
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:190)
> at
> org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:188)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.dependencies(RDD.scala:188)
> at
> org.apache.spark.scheduler.DAGScheduler.getPreferredLocs(DAGScheduler.scala:1144)
> at
> org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:903)
> at
> org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174)
> at
> org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191)
> at
> org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> at
> org.apache.spark.rdd.PartitionCoalescer$LocationIterator.(CoalescedRDD.scala:185)
> at
> org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236)
> at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337)
> at
> org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
> at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:201)
> at org.apache.spark.rdd.RDD.take(RDD.scala:1036)
> at $iwC$$iwC$$iwC$$iwC.(:26)
> at $iwC$$iwC$$iwC.(:31)
> at $iwC$$iwC.(:33)
> at $iwC.(:35)
>   

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Would you mind filing a JIRA for this? That does sound like something bogus
happening on the JVM/YourKit level, but this sort of diagnosis is
sufficiently important that we should be resilient against it.


On Mon, Jul 14, 2014 at 6:01 PM, Will Benton  wrote:

> - Original Message -
> > From: "Aaron Davidson" 
> > To: dev@spark.apache.org
> > Sent: Monday, July 14, 2014 5:21:10 PM
> > Subject: Re: Profiling Spark tests with YourKit (or something else)
> >
> > Out of curiosity, what problems are you seeing with Utils.getCallSite?
>
> Aaron, if I enable call site tracking or CPU profiling in YourKit, many
> (but not all) Spark test cases will NPE on the line filtering out
> "getStackTrace" from the stack trace (this is Utils.scala:812 in the
> current master).  I'm not sure if this is a consequence of
> Thread#getStackTrace including bogus frames when running instrumented or if
> whatever instrumentation YourKit inserts relies on assumptions that don't
> always hold for Scala code.
>
>
> best,
> wb
>


Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Aaron Davidson
One of the core problems here is the number of open streams we have, which
is (# cores * # reduce partitions), which can easily climb into the tens of
thousands for large jobs. This is a more general problem that we are
planning on fixing for our largest shuffles, as even moderate buffer sizes
can explode to use huge amounts of memory at that scale.


On Mon, Jul 14, 2014 at 4:53 PM, Jon Hartlaub  wrote:

> Is the held memory due to just instantiating the LZFOutputStream?  If so,
> I'm a surprised and I consider that a bug.
>
> I suspect the held memory may be due to a SoftReference - memory will be
> released with enough memory pressure.
>
> Finally, is it necessary to keep 1000 (or more) decoders active?  Would it
> be possible to keep an object pool of encoders and check them in and out as
> needed?  I admit I have not done much homework to determine if this is
> viable.
>
> -Jon
>
>
> On Mon, Jul 14, 2014 at 4:08 PM, Reynold Xin  wrote:
>
> > Copying Jon here since he worked on the lzf library at Ning.
> >
> > Jon - any comments on this topic?
> >
> >
> > On Mon, Jul 14, 2014 at 3:54 PM, Matei Zaharia 
> > wrote:
> >
> >> You can actually turn off shuffle compression by setting
> >> spark.shuffle.compress to false. Try that out, there will still be some
> >> buffers for the various OutputStreams, but they should be smaller.
> >>
> >> Matei
> >>
> >> On Jul 14, 2014, at 3:30 PM, Stephen Haberman <
> stephen.haber...@gmail.com>
> >> wrote:
> >>
> >> >
> >> > Just a comment from the peanut gallery, but these buffers are a real
> >> > PITA for us as well. Probably 75% of our non-user-error job failures
> >> > are related to them.
> >> >
> >> > Just naively, what about not doing compression on the fly? E.g. during
> >> > the shuffle just write straight to disk, uncompressed?
> >> >
> >> > For us, we always have plenty of disk space, and if you're concerned
> >> > about network transmission, you could add a separate compress step
> >> > after the blocks have been written to disk, but before being sent over
> >> > the wire.
> >> >
> >> > Granted, IANAE, so perhaps this is a bad idea; either way, awesome to
> >> > see work in this area!
> >> >
> >> > - Stephen
> >> >
> >>
> >>
> >
>


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The patch won't solve the problem where two people try to add a
configuration option at the same time, but I think there is currently an
issue where two people can try to initialize the Configuration at the same
time and still run into a ConcurrentModificationException. This at least
reduces (slightly) the scope of the exception although eliminating it may
not be possible.


On Mon, Jul 14, 2014 at 4:35 PM, Gary Malouf  wrote:

> We'll try to run a build tomorrow AM.
>
>
> On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell 
> wrote:
>
> > Andrew and Gary,
> >
> > Would you guys be able to test
> > https://github.com/apache/spark/pull/1409/files and see if it solves
> > your problem?
> >
> > - Patrick
> >
> > On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash 
> wrote:
> > > I observed a deadlock here when using the AvroInputFormat as well. The
> > > short of the issue is that there's one configuration object per JVM,
> but
> > > multiple threads, one for each task. If each thread attempts to add a
> > > configuration option to the Configuration object at once you get issues
> > > because HashMap isn't thread safe.
> > >
> > > More details to come tonight. Thanks!
> > > On Jul 14, 2014 4:11 PM, "Nishkam Ravi"  wrote:
> > >
> > >> HI Patrick, I'm not aware of another place where the access happens,
> but
> > >> it's possible that it does. The original fix synchronized on the
> > >> broadcastConf object and someone reported the same exception.
> > >>
> > >>
> > >> On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell 
> > >> wrote:
> > >>
> > >> > Hey Nishkam,
> > >> >
> > >> > Aaron's fix should prevent two concurrent accesses to getJobConf
> (and
> > >> > the Hadoop code therein). But if there is code elsewhere that tries
> to
> > >> > mutate the configuration, then I could see how we might still have
> the
> > >> > ConcurrentModificationException.
> > >> >
> > >> > I looked at your patch for HADOOP-10456 and the only example you
> give
> > >> > is of the data being accessed inside of getJobConf. Is it accessed
> > >> > somewhere else too from Spark that you are aware of?
> > >> >
> > >> > https://issues.apache.org/jira/browse/HADOOP-10456
> > >> >
> > >> > - Patrick
> > >> >
> > >> > On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi 
> > >> wrote:
> > >> > > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock
> object
> > >> would
> > >> > > help. I suspect we will start seeing the
> > >> ConcurrentModificationException
> > >> > > again. The right fix has gone into Hadoop through 10456.
> > >> Unfortunately, I
> > >> > > don't have any bright ideas on how to synchronize this at the
> Spark
> > >> level
> > >> > > without the risk of deadlocks.
> > >> > >
> > >> > >
> > >> > > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <
> ilike...@gmail.com
> > >
> > >> > wrote:
> > >> > >
> > >> > >> The full jstack would still be useful, but our current working
> > theory
> > >> is
> > >> > >> that this is due to the fact that Configuration#loadDefaults goes
> > >> > through
> > >> > >> every Configuration object that was ever created (via
> > >> > >> Configuration.REGISTRY) and locks it, thus introducing a
> dependency
> > >> from
> > >> > >> new Configuration to old, otherwise unrelated, Configuration
> > objects
> > >> > that
> > >> > >> our locking did not anticipate.
> > >> > >>
> > >> > >> I have created https://github.com/apache/spark/pull/1409 to
> > hopefully
> > >> > fix
> > >> > >> this bug.
> > >> > >>
> > >> > >>
> > >> > >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <
> > pwend...@gmail.com>
> > >> > >> wrote:
> > >> > >>
> > >> > >> > Hey Cody,
> > >> > >> >
> > >> > >> > This Jstack seems truncated, would you min

Re: Profiling Spark tests with YourKit (or something else)

2014-07-14 Thread Aaron Davidson
Out of curiosity, what problems are you seeing with Utils.getCallSite?


On Mon, Jul 14, 2014 at 2:59 PM, Will Benton  wrote:

> Thanks, Matei; I have also had some success with jmap and friends and will
> probably just stick with them!
>
>
> best,
> wb
>
>
> - Original Message -
> > From: "Matei Zaharia" 
> > To: dev@spark.apache.org
> > Sent: Monday, July 14, 2014 1:02:04 PM
> > Subject: Re: Profiling Spark tests with YourKit (or something else)
> >
> > I haven't seen issues using the JVM's own tools (jstack, jmap, hprof and
> > such), so maybe there's a problem in YourKit or in your release of the
> JVM.
> > Otherwise I'd suggest increasing the heap size of the unit tests a bit
> (you
> > can do this in the SBT build file). Maybe they are very close to full and
> > profiling pushes them over the edge.
> >
> > Matei
> >
> > On Jul 14, 2014, at 9:51 AM, Will Benton  wrote:
> >
> > > Hi all,
> > >
> > > I've been evaluating YourKit and would like to profile the heap and CPU
> > > usage of certain tests from the Spark test suite.  In particular, I'm
> very
> > > interested in tracking heap usage by allocation site.  Unfortunately, I
> > > get a lot of crashes running Spark tests with profiling (and thus
> > > allocation-site tracking) enabled in YourKit; just using the sampler
> works
> > > fine, but it appears that enabling the profiler breaks
> Utils.getCallSite.
> > >
> > > Is there a way to make this combination work?  If not, what are people
> > > using to understand the memory and CPU behavior of Spark and Spark
> apps?
> > >
> > >
> > > thanks,
> > > wb
> >
> >
>


Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

2014-07-14 Thread Aaron Davidson
The full jstack would still be useful, but our current working theory is
that this is due to the fact that Configuration#loadDefaults goes through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency from
new Configuration to old, otherwise unrelated, Configuration objects that
our locking did not anticipate.

I have created https://github.com/apache/spark/pull/1409 to hopefully fix
this bug.


On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell  wrote:

> Hey Cody,
>
> This Jstack seems truncated, would you mind giving the entire stack
> trace? For the second thread, for instance, we can't see where the
> lock is being acquired.
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>  wrote:
> > Hi all, just wanted to give a heads up that we're seeing a reproducible
> > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >
> > If jira is a better place for this, apologies in advance - figured
> talking
> > about it on the mailing list was friendlier than randomly (re)opening
> jira
> > tickets.
> >
> > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
> once
> > we got a thread dump I wanted to follow up.
> >
> > The thread dump shows the deadlock occurs in the synchronized block of
> code
> > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >
> > Relevant portions of the thread dump are summarized below, we can provide
> > the whole dump if it's useful.
> >
> > Found one Java-level deadlock:
> > =
> > "Executor task launch worker-1":
> >   waiting to lock monitor 0x7f250400c520 (object 0xfae7dc30,
> a
> > org.apache.hadoop.co
> > nf.Configuration),
> >   which is held by "Executor task launch worker-0"
> > "Executor task launch worker-0":
> >   waiting to lock monitor 0x7f2520495620 (object 0xfaeb4fc8,
> a
> > java.lang.Class),
> >   which is held by "Executor task launch worker-1"
> >
> >
> > "Executor task launch worker-1":
> > at
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> > - waiting to lock <0xfae7dc30> (a
> > org.apache.hadoop.conf.Configuration)
> > at
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> > - locked <0xfaca6ff8> (a java.lang.Class for
> > org.apache.hadoop.conf.Configurati
> > on)
> > at
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.(HdfsConfiguration.java:34)
> > at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.(DistributedFileSystem.java:110
> > )
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> > at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> > at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > sorImpl.java:45)
> > at
> java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> > at java.lang.Class.newInstance0(Class.java:374)
> > at java.lang.Class.newInstance(Class.java:327)
> > at
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> > at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> > at
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> > - locked <0xfaeb4fc8> (a java.lang.Class for
> > org.apache.hadoop.fs.FileSystem)
> > at
> > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > at
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > at
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > at
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >
> >
> >
> > ...elided...
> >
> >
> > "Executor task launch worker-0" daemon prio=10 tid=0x01e71800
> > nid=0x2d97 wai

Re: ExecutorState.LOADING?

2014-07-09 Thread Aaron Davidson
Agreed that the behavior of the Master killing off an Application when
Executors from the same set of nodes repeatedly die is silly. This can also
strike if a single node enters a state where any Executor created on it
quickly dies (e.g., a block device becomes faulty). This prevents the
Application from launching despite only one node being bad.


On Wed, Jul 9, 2014 at 3:08 PM, Mark Hamstra 
wrote:

> Actually, I'm thinking about re-purposing it.  There's a nasty behavior
> that I'll open a JIRA for soon, and that I'm thinking about addressing by
> introducing/using another ExecutorState transition.  The basic problem is
> that Master can be overly aggressive in calling removeApplication on
> ExecutorStateChanged.  For example, say you have a working, long-running
> Spark stand-alone-mode application and then try to add some more worker
> nodes, but manage to misconfigure the new nodes so that on the new nodes
> Executors never successfully start.  In that scenario, you will repeatedly
> end up in the !normalExit branch of Master's receive ExecutorStateChanged,
> quickly exceed ApplicationState.MAX_NUM_RETRY (a non-configurable 10, which
> is another irritation), and end up having your application killed off even
> though it is still running successfully on the old worker nodes.
>
>
>
> On Wed, Jul 9, 2014 at 2:49 PM, Kay Ousterhout 
> wrote:
>
> > Git history to the rescue!  It seems to have been added by Matei way back
> > in July 2012:
> >
> >
> https://github.com/apache/spark/commit/5d1a887bed8423bd6c25660910d18d91880e01fe
> >
> > and then was removed a few months later (replaced by RUNNING) by the same
> > Mr. Zaharia:
> >
> >
> https://github.com/apache/spark/commit/bb1bce79240da22c2677d9f8159683cdf73158c2#diff-776a630ac2b2ec5fe85c07ca20a58fc0
> >
> > So I'd say it's safe to delete it.
> >
> >
> > On Wed, Jul 9, 2014 at 2:36 PM, Mark Hamstra 
> > wrote:
> >
> > > Doesn't look to me like this is used.  Does anybody recall what it was
> > > intended for?
> > >
> >
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Shark's in-memory format is already serialized (it's compressed and
column-based).


On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan 
wrote:

> You are ignoring serde costs :-)
>
> - Mridul
>
> On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson  wrote:
> > Tachyon should only be marginally less performant than memory_only,
> because
> > we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
> > the data over a pipe from Tachyon; we can directly read from the buffers
> in
> > the same way that Shark reads from its in-memory columnar format.
> >
> >
> >
> > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
> > wrote:
> >
> >> hi, when i create a table, i can point the cache strategy using
> >> shark.cache,
> >> i think "shark.cache=memory_only"  means data are managed by spark, and
> >> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
> >>  means  data are managed by tachyon which is off heap, and data are not
> in
> >> the same jvm with excutor,  so spark will load data from tachyon for
> each
> >> query sql , so,  is  tachyon less efficient than memory_only cache
> strategy
> >>  ?
> >> if yes, can we let spark load all data once from tachyon  for all sql
> query
> >>  if i want to use tachyon cache strategy since tachyon is more HA than
> >> memory_only ?
> >>
>


Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-08 Thread Aaron Davidson
Tachyon should only be marginally less performant than memory_only, because
we mmap the data from Tachyon's ramdisk. We do not have to, say, transfer
the data over a pipe from Tachyon; we can directly read from the buffers in
the same way that Shark reads from its in-memory columnar format.



On Tue, Jul 8, 2014 at 1:18 AM, qingyang li 
wrote:

> hi, when i create a table, i can point the cache strategy using
> shark.cache,
> i think "shark.cache=memory_only"  means data are managed by spark, and
> data are in the same jvm with excutor;   while  "shark.cache=tachyon"
>  means  data are managed by tachyon which is off heap, and data are not in
> the same jvm with excutor,  so spark will load data from tachyon for each
> query sql , so,  is  tachyon less efficient than memory_only cache strategy
>  ?
> if yes, can we let spark load all data once from tachyon  for all sql query
>  if i want to use tachyon cache strategy since tachyon is more HA than
> memory_only ?
>


Re: task always lost

2014-07-03 Thread Aaron Davidson
The issue you're seeing is not the same as the one you linked to -- your
serialized task sizes are very small, and Mesos fine-grained mode doesn't
use Akka anyway.

The error log you printed seems to be from some sort of Mesos logs, but do
you happen to have the logs from the actual executors themselves? These
should be Spark logs which hopefully show the actual Exception (or lack
thereof) before the executors die.

The tasks are dying very quickly, so this is probably either related to
your application logic throwing some sort of fatal JVM error or due to your
Mesos setup. I'm not sure if that "Failed to fetch URIs for container" is
fatal or not.


On Wed, Jul 2, 2014 at 2:44 AM, qingyang li 
wrote:

> executor always been removed.
>
> someone encountered same issue
> https://groups.google.com/forum/#!topic/spark-users/-mYn6BF-Y5Y
>
> -
> 14/07/02 17:41:16 INFO storage.BlockManagerMasterActor: Trying to remove
> executor 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/02 17:41:16 INFO storage.BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/02 17:41:16 DEBUG spark.MapOutputTrackerMaster: Increasing epoch to
> 10
> 14/07/02 17:41:16 INFO scheduler.DAGScheduler: Host gained which was in
> lost list earlier: bigdata001
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 DEBUG scheduler.TaskSchedulerImpl: parentName: , name:
> TaskSet_0, runningTasks: 0
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID
> 12 on executor 20140616-143932-1694607552-5050-4080-3: bigdata004
> (NODE_LOCAL)
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as
> 10785 bytes in 1 ms
> 14/07/02 17:41:16 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID
> 13 on executor 20140616-104524-1694607552-5050-26919-3: bigdata002
> (NODE_LOCAL
>
>
> 2014-07-02 12:01 GMT+08:00 qingyang li :
>
> > also this one in warning log:
> >
> > E0702 11:35:08.869998 17840 slave.cpp:2310] Container
> > 'af557235-2d5f-4062-aaf3-a747cb3cd0d1' for executor
> > '20140616-104524-1694607552-5050-26919-1' of framework
> > '20140702-113428-1694607552-5050-17766-' failed to start: Failed to
> > fetch URIs for container 'af557235-2d5f-4062-aaf3-a747cb3cd0d1': exit
> > status 32512
> >
> >
> > 2014-07-02 11:46 GMT+08:00 qingyang li :
> >
> > Here is the log:
> >>
> >> E0702 10:32:07.599364 14915 slave.cpp:2686] Failed to unmonitor
> container
> >> for executor 20140616-104524-1694607552-5050-26919-1 of framework
> >> 20140702-102939-1694607552-5050-14846-: Not monitored
> >>
> >>
> >> 2014-07-02 1:45 GMT+08:00 Aaron Davidson :
> >>
> >> Can you post the logs from any of the dying executors?
> >>>
> >>>
> >>> On Tue, Jul 1, 2014 at 1:25 AM, qingyang li 
> >>> wrote:
> >>>
> >>> > i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started,
> >>> when I
> >>> > using spark-shell to submit one job, the tasks always lost.  here is
> >>> the
> >>> > log:
> >>> > --
> >>> > 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost
> list
> >>> > earlier: bigdata005
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID
> 4042
> >>> on
> >>> > executor 20140616-143932-1694607552-5050-4080-2: bigdata005
> >>> (PROCESS_LOCAL)
> >>> > 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570
> >>> bytes
> >>> > in 0 ms
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> >>> > 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> >>> > 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> >>> > 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> >>> > 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove
> >>> executor
> >>> > 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> >>> > 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> >>> > 20140616-104524-1694607552-5050-26919-1 successfully in
> removeExecutor
> >>> > 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> >>> > 20140616-143932-169

Re: Pass parameters to RDD functions

2014-07-03 Thread Aaron Davidson
Either Serializable works, scala Serializable extends Java's (originally
intended a common interface for people who didn't want to run Scala on a
JVM).

Class fields require the class be serialized along with the object to
access. If you declared "val n" inside a method's scope instead, though, we
wouldn't need the class. E.g.:

class TextToWordVector(csvData:RDD[Array[String]]) {
  def computeX() = {
val n = 1
csvData.map{ stringArr => stringArr(n)}.collect()
  }
  lazy val x = computeX()
}

Note that if the class itself doesn't actually contain many (large) fields,
though, it may not be an issue to actually transfer it around.



On Thu, Jul 3, 2014 at 5:21 AM, Ulanov, Alexander 
wrote:

> Thanks, this works both with Scala and Java Serializable. Which one should
> I use?
>
> Related question: I would like only the particular val to be used instead
> of the whole class, what should I do?
> As far as I understand, the whole class is serialized and transferred
> between nodes (am I right?)
>
> Alexander
>
> -Original Message-
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, July 03, 2014 3:31 PM
> To: dev@spark.apache.org
> Subject: Re: Pass parameters to RDD functions
>
> Declare this class with "extends Serializable", meaning
> java.io.Serializable?
>
> On Thu, Jul 3, 2014 at 12:24 PM, Ulanov, Alexander <
> alexander.ula...@hp.com> wrote:
> > Hi,
> >
> > I wonder how I can pass parameters to RDD functions with closures. If I
> do it in a following way, Spark crashes with NotSerializableException:
> >
> > class TextToWordVector(csvData:RDD[Array[String]]) {
> >
> >   val n = 1
> >   lazy val x = csvData.map{ stringArr => stringArr(n)}.collect() }
> >
> > Exception:
> > Job aborted due to stage failure: Task not serializable:
> > java.io.NotSerializableException:
> > org.apache.spark.mllib.util.TextToWordVector
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> not serializable: java.io.NotSerializableException:
> org.apache.spark.mllib.util.TextToWordVector
> > at
> > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAG
> > Scheduler$$failJobAndIndependentStages(DAGScheduler.scala:1038)
> >
> >
> > This message proposes a workaround, but it didn't work for me:
> > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAA
> > _qdLrxXzwXd5=6SXLOgSmTTorpOADHjnOXn=tMrOLEJM=f...@mail.gmail.com%3E
> >
> > What is the best practice?
> >
> > Best regards, Alexander
>


Re: task always lost

2014-07-01 Thread Aaron Davidson
Can you post the logs from any of the dying executors?


On Tue, Jul 1, 2014 at 1:25 AM, qingyang li 
wrote:

> i am using mesos0.19 and spark0.9.0 ,  the mesos cluster is started, when I
> using spark-shell to submit one job, the tasks always lost.  here is the
> log:
> --
> 14/07/01 16:24:27 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:27 INFO TaskSetManager: Starting task 0.0:1 as TID 4042 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:27 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-104524-1694607552-5050-26919-1 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4041 (task 0.0:0)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-104524-1694607552-5050-26919-1 (epoch 3427)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-104524-1694607552-5050-26919-1 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-104524-1694607552-5050-26919-1 successfully in removeExecutor
> 14/07/01 16:24:28 INFO TaskSetManager: Re-queueing tasks for
> 20140616-143932-1694607552-5050-4080-2 from TaskSet 0.0
> 14/07/01 16:24:28 WARN TaskSetManager: Lost TID 4042 (task 0.0:1)
> 14/07/01 16:24:28 INFO DAGScheduler: Executor lost:
> 20140616-143932-1694607552-5050-4080-2 (epoch 3428)
> 14/07/01 16:24:28 INFO BlockManagerMasterActor: Trying to remove executor
> 20140616-143932-1694607552-5050-4080-2 from BlockManagerMaster.
> 14/07/01 16:24:28 INFO BlockManagerMaster: Removed
> 20140616-143932-1694607552-5050-4080-2 successfully in removeExecutor
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata005
> 14/07/01 16:24:28 INFO DAGScheduler: Host gained which was in lost list
> earlier: bigdata001
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:1 as TID 4043 on
> executor 20140616-143932-1694607552-5050-4080-2: bigdata005 (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:1 as 1570 bytes
> in 0 ms
> 14/07/01 16:24:28 INFO TaskSetManager: Starting task 0.0:0 as TID 4044 on
> executor 20140616-104524-1694607552-5050-26919-1: bigdata001
> (PROCESS_LOCAL)
> 14/07/01 16:24:28 INFO TaskSetManager: Serialized task 0.0:0 as 1570 bytes
> in 0 ms
>
>
> it seems other guy has also encountered such problem,
>
> http://mail-archives.apache.org/mod_mbox/incubator-mesos-dev/201305.mbox/%3c201305161047069952...@nfs.iscas.ac.cn%3E
>


Re: Eliminate copy while sending data : any Akka experts here ?

2014-06-30 Thread Aaron Davidson
I don't know of any way to avoid Akka doing a copy, but I would like to
mention that it's on the priority list to piggy-back only the map statuses
relevant to a particular map task on the task itself, thus reducing the
total amount of data sent over the wire by a factor of N for N physical
machines in your cluster. Ideally we would also avoid Akka entirely when
sending the tasks, as these can get somewhat large and Akka doesn't work
well with large messages.

Do note that your solution of using broadcast to send the map tasks is very
similar to how the executor returns the result of a task when it's too big
for akka. We were thinking of refactoring this too, as using the block
manager has much higher latency than a direct TCP send.


On Mon, Jun 30, 2014 at 12:13 PM, Mridul Muralidharan 
wrote:

> Our current hack is to use Broadcast variables when serialized
> statuses are above some (configurable) size : and have the workers
> directly pull them from master.
> This is a workaround : so would be great if there was a
> better/principled solution.
>
> Please note that the responses are going to different workers
> requesting for the output statuses for shuffle (after map) - so not
> sure if back pressure buffers, etc would help.
>
>
> Regards,
> Mridul
>
>
> On Mon, Jun 30, 2014 at 11:07 PM, Mridul Muralidharan 
> wrote:
> > Hi,
> >
> >   While sending map output tracker result, the same serialized byte
> > array is sent multiple times - but the akka implementation copies it
> > to a private byte array within ByteString for each send.
> > Caching a ByteString instead of Array[Byte] did not help, since akka
> > does not support special casing ByteString : serializes the
> > ByteString, and copies the result out to an array before creating
> > ByteString out of it (in Array[Byte] serializing is thankfully simply
> > returning same array - so one copy only).
> >
> >
> > Given the need to send immutable data large number of times, is there
> > any way to do it in akka without copying internally in akka ?
> >
> >
> > To see how expensive it is, for 200 nodes withi large number of
> > mappers and reducers, the status becomes something like 30 mb for us -
> > and pulling this about 200 to 300 times results in OOM due to the
> > large number of copies sent out.
> >
> >
> > Thanks,
> > Mridul
>


Re: Why does spark REPL not embed scala REPL?

2014-05-30 Thread Aaron Davidson
There's some discussion here as well on just using the Scala REPL for 2.11:
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-on-Scala-2-11-td6506.html#a6523

Matei's response mentions the features we needed to change from the Scala
REPL (class-based wrappers and where to output the generated classes),
which were added as options to the 2.11 REPL, so we may be able to trim
down a bunch when 2.11 becomes standard.


On Fri, May 30, 2014 at 4:16 AM, Kan Zhang  wrote:

> One reason is standard Scala REPL uses object based wrappers and their
> static initializers will be run on remote worker nodes, which may fail due
> to differences between driver and worker nodes. See discussion here
> https://groups.google.com/d/msg/scala-internals/h27CFLoJXjE/JoobM6NiUMQJ
>
>
> On Fri, May 30, 2014 at 1:12 AM, Aniket 
> wrote:
>
> > My apologies in advance if this is a dev mailing list topic. I am working
> > on
> > a small project to provide web interface to spark REPL. The interface
> will
> > allow people to use spark REPL and perform exploratory analysis on the
> > data.
> > I already have a play application running that provides web interface to
> > standard scala REPL and I am just looking to extend it to optionally
> > include
> > support for spark REPL. My initial idea was to include spark dependencies
> > in
> > the project, create a new instance of SparkContext and bind it to a
> > variable
> > (lets say 'sc') using imain.bind("sc", sparkContext). While theoretically
> > this may work, I am trying to understand why spark REPL takes a different
> > path by creating it's own SparkILoop, SparkIMain, etc. Can anyone help me
> > understand why there was a need to provide custom versions of IMain,
> ILoop,
> > etc instead of embedding the standard scala REPL and binding SparkContext
> > instance?
> >
> > Here is my analysis so far:
> > 1. ExecutorClassLoader - I understand this is need to load classes from
> > HDFS. Perhaps this could have been plugged into the standard scala REPL
> > using settings.embeddedDefaults(classLoaderInstance). Also, it's not
> clear
> > what ConstructorCleaner does.
> >
> > 2. SparkCommandLine & SparkRunnerSettings - Allow for providing an extra
> -i
> > file argument to the REPL. The standard sourcepath wouldn't have
> sufficed?
> >
> > 3. SparkExprTyper - The only difference between standard ExprTyper and
> > SparkExprTyper is that repldbg is replaced with logDebug. Not sure if
> this
> > was intentional/needed.
> >
> > 4. SparkILoop - Has a few deviations from standard ILoop class but this
> > could have been managed by extending or wrapping ILoop class or using
> > settings. Not sure what triggered the need to copy the source code and
> make
> > edits.
> >
> > 5. SparkILoopInit - Changes the welcome message and binds spark context
> in
> > the interpreter. Welcome message could have been changed by extending
> > ILoopInit.
> >
> > 6. SparkIMain - Contains quiet a few changes around class
> > loading/logging/etc but I found it very hard to figure out if extension
> of
> > IMain was an option and what exactly didn't work/will not work with
> IMain.
> >
> > Rest of the classes seem very similar to their standard counterparts. I
> > have
> > a feeling the spark REPL can be refactored to embed standard scala REPL.
> I
> > know refactoring would not help Spark project as such but would help
> people
> > embed the spark REPL much in the same way it's done with standard scala
> > REPL. Thoughts?
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-does-spark-REPL-not-embed-scala-REPL-tp6871.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
>


Re: [VOTE] Release Apache Spark 1.0.0 (RC10)

2014-05-22 Thread Aaron Davidson
In Spark 0.9.0 and 0.9.1, we stopped using the FileSystem cache correctly,
and we just recently resumed using it in 1.0 (and in 0.9.2) when this issue
was fixed: https://issues.apache.org/jira/browse/SPARK-1676

Prior to this fix, each Spark task created and cached its own FileSystems
due to a bug in how the FS cache handles UGIs. The big problem that arose
was that these FileSystems were never closed, so they just kept piling up.
There were two solutions we considered, with the following effects: (1)
Share the FS cache among all tasks and (2) Each task effectively gets its
own FS cache, and closes all of its FSes after the task completes.

We chose solution (1) for 3 reasons:
 - It does not rely on the behavior of a bug in HDFS.
 - It is the most performant option.
 - It is most consistent with the semantics of the (albeit broken) FS cache.

Since this behavior was changed in 1.0, it could be considered a
regression. We should consider the exact behavior we want out of the FS
cache. For Spark's purposes, it seems fine to cache FileSystems across
tasks, as Spark does not close FileSystems. The issue that comes up is that
user code which uses FileSystem.get() but then closes the FileSystem can
screw up Spark processes which were using that FileSystem. The workaround
for users would be to use FileSystem.newInstance() if they want full
control over the lifecycle of their FileSystems.


On Thu, May 22, 2014 at 12:06 PM, Colin McCabe wrote:

> The FileSystem cache is something that has caused a lot of pain over the
> years.  Unfortunately we (in Hadoop core) can't change the way it works now
> because there are too many users depending on the current behavior.
>
> Basically, the idea is that when you request a FileSystem with certain
> options with FileSystem#get, you might get a reference to an FS object that
> already exists, from our FS cache cache singleton.  Unfortunately, this
> also means that someone else can change the working directory on you or
> close the FS underneath you.  The FS is basically shared mutable state, and
> you don't know whom you're sharing with.
>
> It might be better for Spark to call FileSystem#newInstance, which bypasses
> the FileSystem cache and always creates a new object.  If Spark can hang on
> to the FS for a while, it can get the benefits of caching without the
> downsides.  In HDFS, multiple FS instances can also share things like the
> socket cache between them.
>
> best,
> Colin
>
>
> On Thu, May 22, 2014 at 10:06 AM, Marcelo Vanzin  >wrote:
>
> > Hi Kevin,
> >
> > On Thu, May 22, 2014 at 9:49 AM, Kevin Markey 
> > wrote:
> > > The FS closed exception only effects the cleanup of the staging
> > directory,
> > > not the final success or failure.  I've not yet tested the effect of
> > > changing my application's initialization, use, or closing of
> FileSystem.
> >
> > Without going and reading more of the Spark code, if your app is
> > explicitly close()'ing the FileSystem instance, it may be causing the
> > exception. If Spark is caching the FileSystem instance, your app is
> > probably closing that same instance (which it got from the HDFS
> > library's internal cache).
> >
> > It would be nice if you could test that theory; it might be worth
> > knowing that's the case so that we can tell people not to do that.
> >
> > --
> > Marcelo
> >
>


Re: (test)

2014-05-16 Thread Aaron Davidson
No. Only 3 of the responses.


On Fri, May 16, 2014 at 10:38 AM, Nishkam Ravi  wrote:

> Yes.
>
>
> On Fri, May 16, 2014 at 8:40 AM, DB Tsai  wrote:
>
> > Yes.
> > On May 16, 2014 8:39 AM, "Andrew Or"  wrote:
> >
> > > Apache has been having some problems lately. Do you guys see this
> > message?
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.0.0 (rc7)

2014-05-16 Thread Aaron Davidson
It was, but due to the apache infra issues, some may not have received the
email yet...

On Fri, May 16, 2014 at 10:48 AM, Henry Saputra wrote:

> Hi Patrick,
>
> Just want to make sure that VOTE for rc6 also cancelled?
>
>
> Thanks,
>
> Henry
>
> On Thu, May 15, 2014 at 1:15 AM, Patrick Wendell 
> wrote:
> > I'll start the voting with a +1.
> >
> > On Thu, May 15, 2014 at 1:14 AM, Patrick Wendell 
> wrote:
> >> Please vote on releasing the following candidate as Apache Spark
> version 1.0.0!
> >>
> >> This patch has minor documentation changes and fixes on top of rc6.
> >>
> >> The tag to be voted on is v1.0.0-rc7 (commit 9212b3e):
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=9212b3e5bb5545ccfce242da8d89108e6fb1c464
> >>
> >> The release files, including signatures, digests, etc. can be found at:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc7/
> >>
> >> Release artifacts are signed with the following key:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >>
> >> The staging repository for this release can be found at:
> >> https://repository.apache.org/content/repositories/orgapachespark-1015
> >>
> >> The documentation corresponding to this release can be found at:
> >> http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/
> >>
> >> Please vote on releasing this package as Apache Spark 1.0.0!
> >>
> >> The vote is open until Sunday, May 18, at 09:12 UTC and passes if a
> >> majority of at least 3 +1 PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Spark 1.0.0
> >> [ ] -1 Do not release this package because ...
> >>
> >> To learn more about Apache Spark, please see
> >> http://spark.apache.org/
> >>
> >> == API Changes ==
> >> We welcome users to compile Spark applications against 1.0. There are
> >> a few API changes in this release. Here are links to the associated
> >> upgrade guides - user facing changes have been kept as small as
> >> possible.
> >>
> >> changes to ML vector specification:
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
> >>
> >> changes to the Java API:
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
> >>
> >> changes to the streaming API:
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
> >>
> >> changes to the GraphX API:
> >>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
> >>
> >> coGroup and related functions now return Iterable[T] instead of Seq[T]
> >> ==> Call toSeq on the result to restore the old behavior
> >>
> >> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> >> ==> Call toSeq on the result to restore old behavior
>


Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Aaron Davidson
By all means, it would be greatly appreciated!


On Mon, Apr 14, 2014 at 10:34 PM, Ye Xianjin  wrote:

> Hi, I think I have found the cause of the tests failing.
>
> I have two disks on my laptop. The spark project dir is on an HDD disk
> while the tempdir created by google.io.Files.createTempDir is the
> /var/folders/5q/ ,which is on the system disk, an SSD.
> The ExecutorLoaderSuite test uses
> org.apache.spark.TestUtils.createdCompiledClass methods.
> The createCompiledClass method first generates the compiled class in the
> pwd(spark/repl), thens use renameTo to move
> the file. The renameTo method fails because the dest file is in a
> different filesystem than the source file.
>
> I modify the TestUtils.scala to first copy the file to dest then delete
> the original file. The tests go smoothly.
> Should I issue an jira about this problem? Then I can send a pr on Github.
>
> --
> Ye Xianjin
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Tuesday, April 15, 2014 at 3:43 AM, Ye Xianjin wrote:
>
> > well. This is very strange.
> > I looked into ExecutorClassLoaderSuite.scala and ReplSuite.scala and
> made small changes to ExecutorClassLoaderSuite.scala (mostly output some
> internal variables). After that, when running repl test, I noticed the
> ReplSuite
> > was tested first and the test result is ok. But the
> ExecutorClassLoaderSuite test was weird.
> > Here is the output:
> > [info] ExecutorClassLoaderSuite:
> > [error] Uncaught exception when running
> org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError:
> PermGen space
> > [error] Uncaught exception when running
> org.apache.spark.repl.ExecutorClassLoaderSuite: java.lang.OutOfMemoryError:
> PermGen space
> > Internal error when running tests: java.lang.OutOfMemoryError: PermGen
> space
> > Exception in thread "Thread-3" java.io.EOFException
> > at
> java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2577)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
> > at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1685)
> > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1323)
> > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:349)
> > at sbt.React.react(ForkTests.scala:116)
> > at
> sbt.ForkTests$$anonfun$mainTestTask$1$Acceptor$2$.run(ForkTests.scala:75)
> > at java.lang.Thread.run(Thread.java:695)
> >
> >
> > I revert my changes. The test result is same.
> >
> >  I touched the ReplSuite.scala file (use touch command), the test order
> is reversed, same as the very beginning. And the output is also the
> same.(The result in my first post).
> >
> >
> > --
> > Ye Xianjin
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> >
> >
> > On Tuesday, April 15, 2014 at 3:14 AM, Aaron Davidson wrote:
> >
> > > This may have something to do with running the tests on a Mac, as
> there is
> > > a lot of File/URI/URL stuff going on in that test which may just have
> > > happened to work if run on a Linux system (like Jenkins). Note that
> this
> > > suite was added relatively recently:
> > > https://github.com/apache/spark/pull/217
> > >
> > >
> > > On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin  advance...@gmail.com)> wrote:
> > >
> > > > Thank you for your reply.
> > > >
> > > > After building the assembly jar, the repl test still failed. The
> error
> > > > output is same as I post before.
> > > >
> > > > --
> > > > Ye Xianjin
> > > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > > >
> > > >
> > > > On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
> > > >
> > > > > I believe you may need an assembly jar to run the ReplSuite.
> "sbt/sbt
> > > > > assembly/assembly".
> > > > >
> > > > > Michael
> > > > >
> > > > >
> > > > > On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin 
> > > > >  advance...@gmail.com)(mailto:
> > > > advance...@gmail.com (mailto:advance...@gmail.com))> wrote:
> > > > >
> > > > > > Hi, everyone:
> > > > > > I am new to Spark development. I download spark's latest code
> from
> > > > > >
> > > > >
> > > > >
> > > >
> > > > github.
> > > > > > After runnin

Re: Tests failed after assembling the latest code from github

2014-04-14 Thread Aaron Davidson
This may have something to do with running the tests on a Mac, as there is
a lot of File/URI/URL stuff going on in that test which may just have
happened to work if run on a Linux system (like Jenkins). Note that this
suite was added relatively recently:
https://github.com/apache/spark/pull/217


On Mon, Apr 14, 2014 at 12:04 PM, Ye Xianjin  wrote:

> Thank you for your reply.
>
> After building the assembly jar, the repl test still failed. The error
> output is same as I post before.
>
> --
> Ye Xianjin
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Tuesday, April 15, 2014 at 1:39 AM, Michael Armbrust wrote:
>
> > I believe you may need an assembly jar to run the ReplSuite. "sbt/sbt
> > assembly/assembly".
> >
> > Michael
> >
> >
> > On Mon, Apr 14, 2014 at 3:14 AM, Ye Xianjin  advance...@gmail.com)> wrote:
> >
> > > Hi, everyone:
> > > I am new to Spark development. I download spark's latest code from
> github.
> > > After running sbt/sbt assembly,
> > > I began running sbt/sbt test in the spark source code dir. But it
> failed
> > > running the repl module test.
> > >
> > > Here are some output details.
> > >
> > > command:
> > > sbt/sbt "test-only org.apache.spark.repl.*"
> > > output:
> > >
> > > [info] Loading project definition from
> > > /Volumes/MacintoshHD/github/spark/project/project
> > > [info] Loading project definition from
> > > /Volumes/MacintoshHD/github/spark/project
> > > [info] Set current project to root (in build
> > > file:/Volumes/MacintoshHD/github/spark/)
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for graphx/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for bagel/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for streaming/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for mllib/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for catalyst/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for core/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for assembly/test:testOnly
> > > [info] Passed: Total 0, Failed 0, Errors 0, Passed 0
> > > [info] No tests to run for sql/test:testOnly
> > > [info] ExecutorClassLoaderSuite:
> > > 2014-04-14 16:59:31.247 java[8393:1003] Unable to load realm info from
> > > SCDynamicStore
> > > [info] - child first *** FAILED *** (440 milliseconds)
> > > [info] java.lang.ClassNotFoundException: ReplFakeClass2
> > > [info] at java.lang.ClassLoader.findClass(ClassLoader.java:364)
> > > [info] at
> > >
> org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
> > > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> > > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> > > [info] at
> > >
> org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
> > > [info] at scala.Option.getOrElse(Option.scala:120)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57)
> > > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> > > [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply$mcV$sp(ExecutorClassLoaderSuite.scala:47)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoaderSuite$$anonfun$1.apply(ExecutorClassLoaderSuite.scala:44)
> > > [info] at org.scalatest.FunSuite$$anon$1.apply(FunSuite.scala:1265)
> > > [info] at org.scalatest.Suite$class.withFixture(Suite.scala:1974)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoaderSuite.withFixture(ExecutorClassLoaderSuite.scala:30)
> > > [info] at
> > > org.scalatest.FunSuite$class.invokeWithFixture$1(FunSuite.scala:1262)
> > > [info] at
> > > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> > > [info] at
> > > org.scalatest.FunSuite$$anonfun$runTest$1.apply(FunSuite.scala:1271)
> > > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:198)
> > > [info] at org.scalatest.FunSuite$class.runTest(FunSuite.scala:1271)
> > > [info] at
> > >
> org.apache.spark.repl.ExecutorClassLoaderSuite.runTest(ExecutorClassLoaderSuite.scala:30)
> > > [info] at
> > > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.scala:1304)
> > > [info] at
> > > org.scalatest.FunSuite$$anonfun$runTests$1.apply(FunSuite.

Re: Contributing to Spark

2014-04-08 Thread Aaron Davidson
Matei's link seems to point to a specific starter project as part of the
starter list, but here is the list itself:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)


On Mon, Apr 7, 2014 at 10:22 PM, Matei Zaharia wrote:

> I'd suggest looking for the issues labeled "Starter" on JIRA. You can find
> them here:
> https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)
>
> Matei
>
> On Apr 7, 2014, at 9:45 PM, Mukesh G  wrote:
>
> > Hi Sujeet,
> >
> >Thanks. I went thru the website and looks great. Is there a list of
> > items that I can choose from, for contribution?
> >
> > Thanks
> >
> > Mukesh
> >
> >
> > On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
> > wrote:
> >
> >> This is a good place to start:
> >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
> >>
> >> Sujeet
> >>
> >>
> >> On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G  wrote:
> >>
> >>> Hi,
> >>>
> >>>   How I contribute to Spark and it's associated projects?
> >>>
> >>> Appreciate the help...
> >>>
> >>> Thanks
> >>>
> >>> Mukesh
> >>>
> >>
>
>


Re: spark config params conventions

2014-03-12 Thread Aaron Davidson
One solution for typesafe config is to use
"spark.speculation" = true

Typesafe will recognize the key as a string rather than a path, so the name
will actually be "\"spark.speculation\"", so you need to handle this
contingency when passing the config operations to spark (stripping the
quotes from the key).

Solving this in Spark itself is a little tricky because there are ~5 such
conflicts (spark.serializer, spark.speculation, spark.locality.wait,
spark.shuffle.spill, and spark.cleaner.ttl), some of which are used pretty
frequently. We could provide aliases for all of these in Spark, but
actually deprecating the old ones would affect many users, so we could only
do that if enough users would benefit from fully hierarchical config
options.



On Wed, Mar 12, 2014 at 9:24 AM, Mark Hamstra wrote:

> That's the whole reason why some of the intended configuration changes
> were backed out just before the 0.9.0 release.  It's a well-known issue,
> even if a completely satisfactory solution isn't as well-known and is
> probably something which should do another iteration on.
>
>
> On Wed, Mar 12, 2014 at 9:10 AM, Koert Kuipers  wrote:
>
>> i am reading the spark configuration params from another configuration
>> object (typesafe config) before setting them as system properties.
>>
>> i noticed typesafe config has trouble with settings like:
>> spark.speculation=true
>> spark.speculation.interval=0.5
>>
>> the issue seems to be that if spark.speculation is a "container" that has
>> more values inside then it cannot be also a value itself, i think. so this
>> would work fine:
>> spark.speculation.enabled=true
>> spark.speculation.interval=0.5
>>
>> just a heads up. i would probably suggest we avoid this situation.
>>
>
>


Re: spark config params conventions

2014-03-12 Thread Aaron Davidson
Should we try to deprecate these types of configs for 1.0.0? We can start
by accepting both and giving a warning if you use the old one, and then
actually remove them in the next minor release. I think
"spark.speculation.enabled=true" is better than "spark.speculation=true",
and if we decide to use typesafe configs again ourselves, this change is
necessary.

We actually don't have to ever complete the deprecation - we can always
accept both spark.speculation and spark.speculation.enabled, and people
just have to use the latter if they want to use typesafe config.


On Wed, Mar 12, 2014 at 9:24 AM, Mark Hamstra wrote:

> That's the whole reason why some of the intended configuration changes
> were backed out just before the 0.9.0 release.  It's a well-known issue,
> even if a completely satisfactory solution isn't as well-known and is
> probably something which should do another iteration on.
>
>
> On Wed, Mar 12, 2014 at 9:10 AM, Koert Kuipers  wrote:
>
>> i am reading the spark configuration params from another configuration
>> object (typesafe config) before setting them as system properties.
>>
>> i noticed typesafe config has trouble with settings like:
>> spark.speculation=true
>> spark.speculation.interval=0.5
>>
>> the issue seems to be that if spark.speculation is a "container" that has
>> more values inside then it cannot be also a value itself, i think. so this
>> would work fine:
>> spark.speculation.enabled=true
>> spark.speculation.interval=0.5
>>
>> just a heads up. i would probably suggest we avoid this situation.
>>
>
>


Re: Github emails

2014-02-24 Thread Aaron Davidson
By the way, we still need to get our JIRAs migrated over to the Apache
system. Unrelated, just... saying.


On Mon, Feb 24, 2014 at 10:55 PM, Matei Zaharia wrote:

> This is probably a snafu because we had a GitHub hook that was sending
> messages to d...@spark.incubator.apache.org, and that list was recently
> moved (or is in the process of being moved?) to dev@spark.apache.org.
> Unfortunately there's nothing we can do to change it on our end, but this
> was originally set up by Daniel Gruno and Jake Farrel:
> https://issues.apache.org/jira/browse/INFRA-7276.
>
> Matei
>
> On Feb 24, 2014, at 10:46 PM, OmPrakash Muppirala 
> wrote:
>
> > On Feb 24, 2014 10:43 PM, "OmPrakash Muppirala" 
> > wrote:
> >
> >> Looks like the Apache Spark project is sending Github emails to the
> >> infrastructure-dev mailing list.  Can someone please fix this?
> >>
> >> Thanks,
> >> Om
> >>
>
>