Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread pishen
Thank you for the suggestions, actually this project is already on
spark-packages for 1~2 months.
Then I think what I need is some promotions :P

2015-08-25 23:51 GMT+08:00 saurfang [via Apache Spark Developers List] 
ml-node+s1001551n1380...@n3.nabble.com:

 This is very cool. I also have a sbt plugin that automates some aspects of
 spark-submit but for a slightly different goal:
 https://github.com/saurfang/sbt-spark-submit

 The hope there is to address the problem that one can have many Spark main
 functions in a single jar and doing development often involves: change the
 code, sbt assembly, scp the jar to cluster, run spark-submit with fully
 qualified classpath and additional application arguments.
 With my plugin, I'm able to capture all these steps into customizable
 single sbt tasks that are easy to remember (and auto completes in sbt
 console) so you can have multiple sbt tasks corresponding to different Main
 functions, sub-projects and/or default arguments and make the
 build/deploy/submit cycle straight through.


 Currently this works great for YARN because YARN takes care of the jar
 upload and master URL discovery. I have long wanted to make my plugin work
 with spark-ec2 so I can upload jar and infer the master URL
 programatically.

 Thanks for sharing and like Akhil said it'd be nice to have it on
 spark-packages for discovery.

 --
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-developers-list.1001551.n3.nabble.com/Introduce-a-sbt-plugin-to-deploy-and-submit-jobs-to-a-spark-cluster-on-ec2-tp13703p13809.html
 To unsubscribe from Introduce a sbt plugin to deploy and submit jobs to a
 spark cluster on ec2, click here
 http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=13703code=cGlzaGVuMDJAZ21haWwuY29tfDEzNzAzfC0xNjIyODI0MzY2
 .
 NAML
 http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Introduce-a-sbt-plugin-to-deploy-and-submit-jobs-to-a-spark-cluster-on-ec2-tp13703p13810.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Marcelo Vanzin
This probably means your app is failing and the second attempt is
hitting that issue. You may fix the directory already exists error
by setting
spark.eventLog.overwrite=true in your conf, but most probably that
will just expose the actual error in your app.

On Tue, Aug 25, 2015 at 9:37 AM, Varadhan, Jawahar
varad...@yahoo.com.invalid wrote:
 Here is the error


 yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason:
 User class threw exception: Log directory
 hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302
 already exists!)


 I am using cloudera 5.3.2 with Spark 1.2.0


 Any help is appreciated.


 Thanks

 Jay






-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I'm planning to close the survey to further responses early next week.

 If you haven't chimed in yet, the link to the survey is here:

 http://goo.gl/forms/erct2s6KRR

 We already have some great responses, which you can view. I'll share a
 summary after the survey is closed.

 Cheers!

 Nick


 On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick
 ​




Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar
Here is the error
yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User 
class threw exception: Log directory 
hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 
already exists!)
I am using cloudera 5.3.2 with Spark 1.2.0
Any help is appreciated.
ThanksJay



Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 2:17 AM,  andrew.row...@thomsonreuters.com wrote:
 Then, if I wanted to do a build against a specific profile, I could also
 pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts
 correctly named. The default behaviour should be the same. Child pom files
 would need to reference ${spark.version} in their parent section I think.

 Any objections to this?

Have you tried it? My understanding is that no project does that
because it doesn't work. To resolve properties you need to read the
parent pom(s), and if there's a variable reference there, well, you
can't do it. Chicken  egg.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.5.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/


The tag to be voted on is v1.5.0-rc2:
https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release (published as 1.5.0-rc2) can be
found at:
https://repository.apache.org/content/repositories/orgapachespark-1141/

The staging repository for this release (published as 1.5.0) can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1140/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/


===
How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.



What justifies a -1 vote for this release?

This vote is happening towards the end of the 1.5 QA period, so -1 votes
should only occur for significant regressions from 1.4. Bugs already
present in 1.4, minor regressions, or bugs related to new features will not
block this release.


===
What should happen to JIRA tickets still targeting 1.5.0?
===
1. It is OK for documentation patches to target 1.5.0 and still go into
branch-1.5, since documentations will be packaged separately from the
release.
2. New features for non-alpha-modules should target 1.6+.
3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
version.


==
Major changes to help you focus your testing
==

As of today, Spark 1.5 contains more than 1000 commits from 220+
contributors. I've curated a list of important changes for 1.5. For the
complete list, please refer to Apache JIRA changelog.

RDD/DataFrame/SQL APIs

- New UDAF interface
- DataFrame hints for broadcast join
- expr function for turning a SQL expression into DataFrame column
- Improved support for NaN values
- StructType now supports ordering
- TimestampType precision is reduced to 1us
- 100 new built-in expressions, including date/time, string, math
- memory and local disk only checkpointing

DataFrame/SQL Backend Execution

- Code generation on by default
- Improved join, aggregation, shuffle, sorting with cache friendly
algorithms and external algorithms
- Improved window function performance
- Better metrics instrumentation and reporting for DF/SQL execution plans

Data Sources, Hive, Hadoop, Mesos and Cluster Management

- Dynamic allocation support in all resource managers (Mesos, YARN,
Standalone)
- Improved Mesos support (framework authentication, roles, dynamic
allocation, constraints)
- Improved YARN support (dynamic allocation with preferred locations)
- Improved Hive support (metastore partition pruning, metastore
connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
- Support persisting data in Hive compatible format in metastore
- Support data partitioning for JSON data sources
- Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata
discovery and schema merging, support reading non-standard legacy Parquet
files generated by other libraries)
- Faster and more robust dynamic partition insert
- DataSourceRegister interface for external data sources to specify short
names

SparkR

- YARN cluster mode in R
- GLMs with R formula, binomial/Gaussian families, and elastic-net
regularization
- Improved error messages
- Aliases to make DataFrame functions more R-like

Streaming

- Backpressure for handling bursty input streams.
- Improved Python support for streaming sources (Kafka offsets, Kinesis,
MQTT, Flume)
- Improved Python streaming machine learning algorithms (K-Means, linear
regression, logistic regression)
- Native reliable Kinesis stream support
- Input metadata like Kafka offsets made visible in the batch details UI
- Better load balancing and scheduling of receivers across cluster
- Include streaming storage in web UI

Machine Learning and Advanced Analytics

- Feature transformers: CountVectorizer, Discrete Cosine transformation,
MinMaxScaler, NGram, PCA, RFormula, 

RE: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Wang, Yanping
Hi, Reynold and others

I agree with your comments on mid-tenured objects and GC. In fact, dealing with 
mid-tenured objects are the major challenge for all java GC implementations.

I am wondering if anyone has played -XX:+PrintTenuringDistribution flags and 
see how exactly ages distribution look like when your program runs?
My output with -XX:+PrintGCDetails look like below: (Oracle jdk8 update 60 
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Age 1-5 are young guys, 13, 14, 15 are old guys.
The middle guys will have to be copied multiple times before become dead in old 
regions normally need some major GC to clean them up.

Desired survivor size 2583691264 bytes, new threshold 15 (max 15)
- age   1:   13474960 bytes,   13474960 total
- age   2:2815592 bytes,   16290552 total
- age   3: 632784 bytes,   16923336 total
- age   4: 428432 bytes,   17351768 total
- age   5: 648696 bytes,   18000464 total
- age   6: 572328 bytes,   18572792 total
- age   7: 549216 bytes,   19122008 total
- age   8: 539544 bytes,   19661552 total
- age   9: 422256 bytes,   20083808 total
- age  10: 552928 bytes,   20636736 total
- age  11: 430464 bytes,   21067200 total
- age  12: 753320 bytes,   21820520 total
- age  13: 230864 bytes,   22051384 total
- age  14: 276288 bytes,   22327672 total
- age  15: 809272 bytes,   23136944 total

I’d love to see how others’ objects’ age distribution look like. Actually once 
we know the age distribution for some particular use cases, we can find a ways 
to avoid Full GC. Full GC is expensive because both CMS and G1 Full GC are 
single threaded. GC tuning nowadays becomes a task of just trying to avoid Full 
GC completely.

Thanks
-yanping

From: Reynold Xin [mailto:r...@databricks.com]
Sent: Tuesday, August 25, 2015 6:05 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Dataframe aggregation with Tungsten unsafe

There are a lot of GC activity due to the non-code-gen path being sloppy about 
garbage creation. This is not actually what happens, but just as an example:

rdd.map { i: Int = i + 1 }

This under the hood becomes a closure that boxes on every input and every 
output, creating two extra objects.

The reality is more complicated than this -- but here's a simpler view of what 
happens with GC in these cases. You might've heard from other places that the 
JVM is very efficient about transient object allocations. That is true when you 
look at these allocations in isolation, but unfortunately not true when you 
look at them in aggregate.

First, due to the way the iterator interface is constructed, it is hard for the 
JIT compiler to on-stack allocate these objects. Then two things happen:

1. They pile up and cause more young gen GCs to happen.
2. After a few young gen GCs, some mid-tenured objects (e.g. an aggregation 
map) get copied into the old-gen, and eventually requires a full GC to free 
them. Full GCs are much more expensive than young gen GCs (usually involves 
copying all the data in the old gen).

So the more garbages that are created - the more frequently full GC happens.

The more long lived objects in the old gen (e.g. cache) - the more expensive 
full GC is.



On Tue, Aug 25, 2015 at 5:19 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory 
and each worker has 32GB of memory. It seems to be a lot of free memory 
available. I wonder how Spark can hit GC with such setup?

Reynold Xin 
r...@databricks.commailto:r...@databricks.commailto:r...@databricks.commailto:r...@databricks.com

On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com
 wrote:

It seems that there is a nice improvement with Tungsten enabled given that data 
is persisted in memory 2x and 3x. However, the improvement is not that nice for 
parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of 
in-memory data and parquet data aggregation is similar. Could anyone comment on 
this? It seems counterintuitive to me.

Local performance was not as good as Reynold had. I have around 1.5x, he had 
5x. However, local mode is not interesting.


I think a large part of that is coming from the pressure created by JVM GC. 
Putting more data in-memory makes GC worse, unless GC is well tuned.





Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves
Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 
1.5.0 Documentation

|   |
|   |   |   |   |   |
| Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame 
Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames 
DataFrame Operations Running SQL Queries Programmatically Interoperating with 
RDDs  |
|  |
| View on people.apache.org | Preview by Yahoo |
|  |
|   |


it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark.
Tom 


 On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com 
wrote:
   

 I see that there's an 1.5.0-rc2 tag in github now.  Is that the official RC2 
tag to start trying out?
-Sandy
On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote:

PS Shixiong Zhu is correct that this one has to be fixed:
https://issues.apache.org/jira/browse/SPARK-10168

For example you can see assemblies like this are nearly empty:
https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/

Just a publishing glitch but worth a few more eyes on.

On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote:
 Signatures, license, etc. look good. I'm getting some fairly
 consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
 -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
 are likely just test problems, but worth asking. Stack traces are at
 the end.

 There are currently 79 issues targeted for 1.5.0, of which 19 are
 bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
 That's significantly better than at the last release. I presume a lot
 of what's still targeted is not critical and can now be
 untargeted/retargeted.

 It occurs to me that the flurry of planning that took place at the
 start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
 the kind of thing that would be even more useful at the start of a
 release cycle. So would be great to do this for 1.6 in a few weeks.
 Indeed there are already 267 issues targeted for 1.6.0 -- a decent
 roadmap already.


 Test failures:

 Core

 - Unpersisting TorrentBroadcast on executors and driver in distributed
 mode *** FAILED ***
   java.util.concurrent.TimeoutException: Can't find 2 executors before
 1 milliseconds elapsed
   at 
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
   at 
org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
   at 
org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...

 Streaming

 - stop slow receiver gracefully *** FAILED ***
   0 was not greater than 0 (StreamingContextSuite.scala:324)

 Kafka

 - offset recovery *** FAILED ***
   The code passed to eventually never returned normally. Attempted 191
 times over 10.043196973 seconds. Last failure message:
 strings.forall({
     ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
   }) was false. (DirectKafkaStreamSuite.scala:249)

 On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote:
 Please vote on releasing the following candidate as Apache Spark version
 1.5.0!

 The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc1:
 https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1137/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/


 ===
 == How can I help test this release? ==
 ===
 If you are a Spark user, you can help us test this release 

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread Akhil Das
You can add it to the spark packages i guess http://spark-packages.org/

Thanks
Best Regards

On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote:

 Sorry for previous line-breaking format, try to resend the mail again.

 I have written a sbt plugin called spark-deployer, which is able to deploy
 a standalone spark cluster on aws ec2 and submit jobs to it.
 https://github.com/pishen/spark-deployer

 Compared to current spark-ec2 script, this design may have several
 benefits (features):
 1. All the code are written in Scala.
 2. Just add one line in your project/plugins.sbt and you are ready to go.
 (You don't have to download the python code and store it at someplace.)
 3. The whole development flow (write code for spark job, compile the code,
 launch the cluster, assembly and submit the job to master, terminate the
 cluster when the job is finished) can be done in sbt.
 4. Support parallel deployment of the worker machines by Scala's Future.
 5. Allow dynamically add or remove worker machines to/from the current
 cluster.
 6. All the configurations are stored in a typesafe config file. You don't
 need to store it elsewhere and map the settings into spark-ec2's command
 line arguments.
 7. The core library is separated from sbt plugin, hence it's possible to
 execute the deployment from an environment without sbt (only JVM is
 required).
 8. Support adjustable ec2 root disk size, custom security groups, custom
 ami (can run on default Amazon ami), custom spark tarball, and VPC. (Well,
 most of these are also supported in spark-ec2 in slightly different form,
 just mention it anyway.)

 Since this project is still in its early stage, it lacks some features of
 spark-ec2 such as self-installed HDFS (we use s3 directly), stoppable
 cluster, ganglia, and the copy script.
 However, it's already usable for our company and we are trying to move our
 production spark projects from spark-ec2 to spark-deployer.

 Any suggestion, testing help, or pull request are highly appreciated.

 On top of that, I would like to contribute this project to Spark, maybe as
 another choice (suggestion link) alongside spark-ec2 on Spark's official
 documentation.
 Of course, before that, I have to make this project stable enough (strange
 errors just happen on aws api from time to time).
 I'm wondering if this kind of contribution is possible and is there any
 rule to follow or anyone to contact?
 (Maybe the source code will not be merged into spark's main repository,
 since I've noticed that spark-ec2 is also planning to move out.)

 Regards,
 Pishen Tsai




Spark builds: allow user override of project version at buildtime

2015-08-25 Thread andrew.rowson
I've got an interesting challenge in building Spark. For various reasons we
do a few different builds of spark, typically with a few different profile
options (e.g. against different versions of Hadoop, some with/without Hive
etc.). We mirror the spark repo internally and have a buildserver that
builds and publishes different Spark versions to an artifactory server. The
problem is that the output of each build is published with the version that
is in the pom.xml file - a build of Spark @tags/v1.4.1 always comes out with
an artefact version of '1.4.1'. However, because we may have three different
Spark builds for 1.4.1, it'd be useful to be able to override this version
at build time, so that we can publish 1.4.1, 1.4.1-cdh5.3.3 and maybe
1.4.1-cdh5.3.3-hive as separate artifacts. 

My understanding of maven is that the /project/version value in the pom.xml
isn't overridable. At the moment, I've hacked around this by having a
pre-build task that rewrites the various pom files and adjust the version to
a string that's correct for that particular build. 

Would it be useful to instead populate the version from a maven property,
which could then be overridable on the CLI? Something like:

project
version${spark.version}/version
properties
spark.version1.4.1/version
/properties
/project

Then, if I wanted to do a build against a specific profile, I could also
pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts
correctly named. The default behaviour should be the same. Child pom files
would need to reference ${spark.version} in their parent section I think.

Any objections to this?

Andrew


smime.p7s
Description: S/MIME cryptographic signature


Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin
Hello y'all,

So I've been getting kinda annoyed with how many PR tests have been
timing out. I took one of the logs from one of my PRs and started to
do some crunching on the data from the output, and here's a list of
the 5 slowest suites:

307.14s HiveSparkSubmitSuite
382.641s VersionsSuite
398s CliSuite
410.52s HashJoinCompatibilitySuite
2508.61s HiveCompatibilitySuite

Looking at those, I'm not surprised at all that we see so many
timeouts. Is there any ongoing effort to trim down those tests
(especially HiveCompatibilitySuite) or somehow restrict when they're
run?

Almost 1 hour to run a single test suite that affects a rather
isolated part of the code base looks a little excessive to me.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Michael Armbrust
I'd be okay skipping the HiveCompatibilitySuite for core-only changes.
They do often catch bugs in changes to catalyst or sql though.  Same for
HashJoinCompatibilitySuite/VersionsSuite.

HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things
like addJar that have been broken by core in the past.

On Tue, Aug 25, 2015 at 1:40 PM, Patrick Wendell pwend...@gmail.com wrote:

 There is already code in place that restricts which tests run
 depending on which code is modified. However, changes inside of
 Spark's core currently require running all dependent tests. If you
 have some ideas about how to improve that heuristic, it would be
 great.

 - Patrick

 On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  Hello y'all,
 
  So I've been getting kinda annoyed with how many PR tests have been
  timing out. I took one of the logs from one of my PRs and started to
  do some crunching on the data from the output, and here's a list of
  the 5 slowest suites:
 
  307.14s HiveSparkSubmitSuite
  382.641s VersionsSuite
  398s CliSuite
  410.52s HashJoinCompatibilitySuite
  2508.61s HiveCompatibilitySuite
 
  Looking at those, I'm not surprised at all that we see so many
  timeouts. Is there any ongoing effort to trim down those tests
  (especially HiveCompatibilitySuite) or somehow restrict when they're
  run?
 
  Almost 1 hour to run a single test suite that affects a rather
  isolated part of the code base looks a little excessive to me.
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves
Anyone using HiveContext with secure Hive with Spark 1.5 and have it working?
We have a non standard version of hive but was pulling our hive jars and its 
failing to authenticate.  It could be something in our hive version but 
wondering if spark isn't forwarding credentials properly.
Tom 


 On Tuesday, August 25, 2015 1:56 PM, Tom Graves 
tgraves...@yahoo.com.INVALID wrote:
   

 Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 
1.5.0 Documentation

|   |
|   |   |   |   |   |
| Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame 
Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames 
DataFrame Operations Running SQL Queries Programmatically Interoperating with 
RDDs  |
|  |
| View on people.apache.org | Preview by Yahoo |
|  |
|   |


it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark.
Tom 


 On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com 
wrote:
   

 I see that there's an 1.5.0-rc2 tag in github now.  Is that the official RC2 
tag to start trying out?
-Sandy
On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote:

PS Shixiong Zhu is correct that this one has to be fixed:
https://issues.apache.org/jira/browse/SPARK-10168

For example you can see assemblies like this are nearly empty:
https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/

Just a publishing glitch but worth a few more eyes on.

On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote:
 Signatures, license, etc. look good. I'm getting some fairly
 consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
 -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
 are likely just test problems, but worth asking. Stack traces are at
 the end.

 There are currently 79 issues targeted for 1.5.0, of which 19 are
 bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
 That's significantly better than at the last release. I presume a lot
 of what's still targeted is not critical and can now be
 untargeted/retargeted.

 It occurs to me that the flurry of planning that took place at the
 start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
 the kind of thing that would be even more useful at the start of a
 release cycle. So would be great to do this for 1.6 in a few weeks.
 Indeed there are already 267 issues targeted for 1.6.0 -- a decent
 roadmap already.


 Test failures:

 Core

 - Unpersisting TorrentBroadcast on executors and driver in distributed
 mode *** FAILED ***
   java.util.concurrent.TimeoutException: Can't find 2 executors before
 1 milliseconds elapsed
   at 
org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
   at 
org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
   at 
org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at 
org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
   at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...

 Streaming

 - stop slow receiver gracefully *** FAILED ***
   0 was not greater than 0 (StreamingContextSuite.scala:324)

 Kafka

 - offset recovery *** FAILED ***
   The code passed to eventually never returned normally. Attempted 191
 times over 10.043196973 seconds. Last failure message:
 strings.forall({
     ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
   }) was false. (DirectKafkaStreamSuite.scala:249)

 On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote:
 Please vote on releasing the following candidate as Apache Spark version
 1.5.0!

 The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
 majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.5.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark.apache.org/


 The tag to be voted on is v1.5.0-rc1:
 https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Patrick Wendell
There is already code in place that restricts which tests run
depending on which code is modified. However, changes inside of
Spark's core currently require running all dependent tests. If you
have some ideas about how to improve that heuristic, it would be
great.

- Patrick

On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello y'all,

 So I've been getting kinda annoyed with how many PR tests have been
 timing out. I took one of the logs from one of my PRs and started to
 do some crunching on the data from the output, and here's a list of
 the 5 slowest suites:

 307.14s HiveSparkSubmitSuite
 382.641s VersionsSuite
 398s CliSuite
 410.52s HashJoinCompatibilitySuite
 2508.61s HiveCompatibilitySuite

 Looking at those, I'm not surprised at all that we see so many
 timeouts. Is there any ongoing effort to trim down those tests
 (especially HiveCompatibilitySuite) or somehow restrict when they're
 run?

 Almost 1 hour to run a single test suite that affects a rather
 isolated part of the code base looks a little excessive to me.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Ulanov, Alexander
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory 
and each worker has 32GB of memory. It seems to be a lot of free memory 
available. I wonder how Spark can hit GC with such setup?

Reynold Xin r...@databricks.commailto:r...@databricks.com


On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:

It seems that there is a nice improvement with Tungsten enabled given that data 
is persisted in memory 2x and 3x. However, the improvement is not that nice for 
parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of 
in-memory data and parquet data aggregation is similar. Could anyone comment on 
this? It seems counterintuitive to me.

Local performance was not as good as Reynold had. I have around 1.5x, he had 
5x. However, local mode is not interesting.


I think a large part of that is coming from the pressure created by JVM GC. 
Putting more data in-memory makes GC worse, unless GC is well tuned.




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Doug Balog
It works for me in cluster mode. 
I’m running on Hortonworks 2.2.4.12 in secure mode with Hive 0.14
I built with

./make-distribution —tgz -Phive -Phive-thriftserver -Phbase-provided -Pyarn 
-Phadoop-2.6 

Doug



 On Aug 25, 2015, at 4:56 PM, Tom Graves tgraves...@yahoo.com.INVALID wrote:
 
 Anyone using HiveContext with secure Hive with Spark 1.5 and have it working?
 
 We have a non standard version of hive but was pulling our hive jars and its 
 failing to authenticate.  It could be something in our hive version but 
 wondering if spark isn't forwarding credentials properly.
 
 Tom
 
 
 
 On Tuesday, August 25, 2015 1:56 PM, Tom Graves 
 tgraves...@yahoo.com.INVALID wrote:
 
 
 Is there a jira to update the sql hive docs?
 Spark SQL and DataFrames - Spark 1.5.0 Documentation
  
  
  
  
  
  
 Spark SQL and DataFrames - Spark 1.5.0 Documentation
 Spark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext 
 Creating DataFrames DataFrame Operations Running SQL Queries Programmatically 
 Interoperating with RDDs
 View on people.apache.org
 Preview by Yahoo
  
 
 it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark.
 
 Tom
 
 
 
 On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com 
 wrote:
 
 
 I see that there's an 1.5.0-rc2 tag in github now.  Is that the official RC2 
 tag to start trying out?
 
 -Sandy
 
 On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote:
 PS Shixiong Zhu is correct that this one has to be fixed:
 https://issues.apache.org/jira/browse/SPARK-10168
 
 For example you can see assemblies like this are nearly empty:
 https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/
 
 Just a publishing glitch but worth a few more eyes on.
 
 On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote:
  Signatures, license, etc. look good. I'm getting some fairly
  consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive
  -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they
  are likely just test problems, but worth asking. Stack traces are at
  the end.
 
  There are currently 79 issues targeted for 1.5.0, of which 19 are
  bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.)
  That's significantly better than at the last release. I presume a lot
  of what's still targeted is not critical and can now be
  untargeted/retargeted.
 
  It occurs to me that the flurry of planning that took place at the
  start of the 1.5 QA cycle a few weeks ago was quite helpful, and is
  the kind of thing that would be even more useful at the start of a
  release cycle. So would be great to do this for 1.6 in a few weeks.
  Indeed there are already 267 issues targeted for 1.6.0 -- a decent
  roadmap already.
 
 
  Test failures:
 
  Core
 
  - Unpersisting TorrentBroadcast on executors and driver in distributed
  mode *** FAILED ***
java.util.concurrent.TimeoutException: Can't find 2 executors before
  1 milliseconds elapsed
at 
  org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561)
at 
  org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313)
at 
  org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287)
at 
  org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165)
at 
  org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at 
  org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165)
at 
  org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
...
 
  Streaming
 
  - stop slow receiver gracefully *** FAILED ***
0 was not greater than 0 (StreamingContextSuite.scala:324)
 
  Kafka
 
  - offset recovery *** FAILED ***
The code passed to eventually never returned normally. Attempted 191
  times over 10.043196973 seconds. Last failure message:
  strings.forall({
  ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem))
}) was false. (DirectKafkaStreamSuite.scala:249)
 
  On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote:
  Please vote on releasing the following candidate as Apache Spark version
  1.5.0!
 
  The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a
  majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.5.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see http://spark.apache.org/
 
 
  The tag to be voted on is v1.5.0-rc1:
  

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Reynold Xin
On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:



 It seems that there is a nice improvement with Tungsten enabled given that
 data is persisted in memory 2x and 3x. However, the improvement is not that
 nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled
 performance of in-memory data and parquet data aggregation is similar.
 Could anyone comment on this? It seems counterintuitive to me.



 Local performance was not as good as Reynold had. I have around 1.5x, he
 had 5x. However, local mode is not interesting.




I think a large part of that is coming from the pressure created by JVM GC.
Putting more data in-memory makes GC worse, unless GC is well tuned.


Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin
I chatted with Patrick briefly offline. It would be interesting to
know whether the scripts have some way of saying run a smaller
version of certain tests (e.g. by setting a system property that the
tests look at to decide what to run). That way, if there are no
changes under sql/, we could still run a small part of
HiveCompatibilitySuite, just not all of it. The reasoning being that
if a core change breaks something in Hive, it will probably break many
tests, not a specific one.

On Tue, Aug 25, 2015 at 1:48 PM, Michael Armbrust
mich...@databricks.com wrote:
 I'd be okay skipping the HiveCompatibilitySuite for core-only changes.  They
 do often catch bugs in changes to catalyst or sql though.  Same for
 HashJoinCompatibilitySuite/VersionsSuite.

 HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things
 like addJar that have been broken by core in the past.

 On Tue, Aug 25, 2015 at 1:40 PM, Patrick Wendell pwend...@gmail.com wrote:

 There is already code in place that restricts which tests run
 depending on which code is modified. However, changes inside of
 Spark's core currently require running all dependent tests. If you
 have some ideas about how to improve that heuristic, it would be
 great.

 - Patrick

 On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com
 wrote:
  Hello y'all,
 
  So I've been getting kinda annoyed with how many PR tests have been
  timing out. I took one of the logs from one of my PRs and started to
  do some crunching on the data from the output, and here's a list of
  the 5 slowest suites:
 
  307.14s HiveSparkSubmitSuite
  382.641s VersionsSuite
  398s CliSuite
  410.52s HashJoinCompatibilitySuite
  2508.61s HiveCompatibilitySuite
 
  Looking at those, I'm not surprised at all that we see so many
  timeouts. Is there any ongoing effort to trim down those tests
  (especially HiveCompatibilitySuite) or somehow restrict when they're
  run?
 
  Almost 1 hour to run a single test suite that affects a rather
  isolated part of the code base looks a little excessive to me.
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Michael Armbrust
This isn't really answering the question, but for what it is worth, I
manage several different branches of Spark and publish custom named
versions regularly to an internal repository, and this is *much* easier
with SBT than with maven.  You can actually link the Spark SBT build into
an external SBT build and write commands that cross publish as needed.

For your case something as simple as build/sbt set version in Global :=
'1.4.1-custom-string' publish might do the trick.

On Tue, Aug 25, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 On Tue, Aug 25, 2015 at 2:17 AM,  andrew.row...@thomsonreuters.com
 wrote:
  Then, if I wanted to do a build against a specific profile, I could also
  pass in a -Dspark.version=1.4.1-custom-string and have the output
 artifacts
  correctly named. The default behaviour should be the same. Child pom
 files
  would need to reference ${spark.version} in their parent section I think.
 
  Any objections to this?

 Have you tried it? My understanding is that no project does that
 because it doesn't work. To resolve properties you need to read the
 parent pom(s), and if there's a variable reference there, well, you
 can't do it. Chicken  egg.

 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org