Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread pishen
Thank you for the suggestions, actually this project is already on spark-packages for 1~2 months. Then I think what I need is some promotions :P 2015-08-25 23:51 GMT+08:00 saurfang [via Apache Spark Developers List] ml-node+s1001551n1380...@n3.nabble.com: This is very cool. I also have a sbt

Re: Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Marcelo Vanzin
This probably means your app is failing and the second attempt is hitting that issue. You may fix the directory already exists error by setting spark.eventLog.overwrite=true in your conf, but most probably that will just expose the actual error in your app. On Tue, Aug 25, 2015 at 9:37 AM,

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm planning to close the survey to further responses

Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar
Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated.

Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 2:17 AM, andrew.row...@thomsonreuters.com wrote: Then, if I wanted to do a build against a specific profile, I could also pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts correctly named. The default behaviour should be the same. Child pom

[VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-25 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ...

RE: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Wang, Yanping
Hi, Reynold and others I agree with your comments on mid-tenured objects and GC. In fact, dealing with mid-tenured objects are the major challenge for all java GC implementations. I am wondering if anyone has played -XX:+PrintTenuringDistribution flags and see how exactly ages distribution

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves
Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 1.5.0 Documentation |   | |   |   |   |   |   | | Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames DataFrame

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-25 Thread Akhil Das
You can add it to the spark packages i guess http://spark-packages.org/ Thanks Best Regards On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote: Sorry for previous line-breaking format, try to resend the mail again. I have written a sbt plugin called spark-deployer, which

Spark builds: allow user override of project version at buildtime

2015-08-25 Thread andrew.rowson
I've got an interesting challenge in building Spark. For various reasons we do a few different builds of spark, typically with a few different profile options (e.g. against different versions of Hadoop, some with/without Hive etc.). We mirror the spark repo internally and have a buildserver that

Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin
Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Michael Armbrust
I'd be okay skipping the HiveCompatibilitySuite for core-only changes. They do often catch bugs in changes to catalyst or sql though. Same for HashJoinCompatibilitySuite/VersionsSuite. HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things like addJar that have been broken by

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Tom Graves
Anyone using HiveContext with secure Hive with Spark 1.5 and have it working? We have a non standard version of hive but was pulling our hive jars and its failing to authenticate.  It could be something in our hive version but wondering if spark isn't forwarding credentials properly. Tom

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Patrick Wendell
There is already code in place that restricts which tests run depending on which code is modified. However, changes inside of Spark's core currently require running all dependent tests. If you have some ideas about how to improve that heuristic, it would be great. - Patrick On Tue, Aug 25, 2015

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Ulanov, Alexander
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup? Reynold Xin r...@databricks.commailto:r...@databricks.com On Fri, Aug 21, 2015 at

Re: [VOTE] Release Apache Spark 1.5.0 (RC1)

2015-08-25 Thread Doug Balog
It works for me in cluster mode. I’m running on Hortonworks 2.2.4.12 in secure mode with Hive 0.14 I built with ./make-distribution —tgz -Phive -Phive-thriftserver -Phbase-provided -Pyarn -Phadoop-2.6 Doug On Aug 25, 2015, at 4:56 PM, Tom Graves tgraves...@yahoo.com.INVALID wrote:

Re: Dataframe aggregation with Tungsten unsafe

2015-08-25 Thread Reynold Xin
On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with

Re: Paring down / tagging tests (or some other way to avoid timeouts)?

2015-08-25 Thread Marcelo Vanzin
I chatted with Patrick briefly offline. It would be interesting to know whether the scripts have some way of saying run a smaller version of certain tests (e.g. by setting a system property that the tests look at to decide what to run). That way, if there are no changes under sql/, we could still

Re: Spark builds: allow user override of project version at buildtime

2015-08-25 Thread Michael Armbrust
This isn't really answering the question, but for what it is worth, I manage several different branches of Spark and publish custom named versions regularly to an internal repository, and this is *much* easier with SBT than with maven. You can actually link the Spark SBT build into an external