Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2
Thank you for the suggestions, actually this project is already on spark-packages for 1~2 months. Then I think what I need is some promotions :P 2015-08-25 23:51 GMT+08:00 saurfang [via Apache Spark Developers List] ml-node+s1001551n1380...@n3.nabble.com: This is very cool. I also have a sbt plugin that automates some aspects of spark-submit but for a slightly different goal: https://github.com/saurfang/sbt-spark-submit The hope there is to address the problem that one can have many Spark main functions in a single jar and doing development often involves: change the code, sbt assembly, scp the jar to cluster, run spark-submit with fully qualified classpath and additional application arguments. With my plugin, I'm able to capture all these steps into customizable single sbt tasks that are easy to remember (and auto completes in sbt console) so you can have multiple sbt tasks corresponding to different Main functions, sub-projects and/or default arguments and make the build/deploy/submit cycle straight through. Currently this works great for YARN because YARN takes care of the jar upload and master URL discovery. I have long wanted to make my plugin work with spark-ec2 so I can upload jar and infer the master URL programatically. Thanks for sharing and like Akhil said it'd be nice to have it on spark-packages for discovery. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-developers-list.1001551.n3.nabble.com/Introduce-a-sbt-plugin-to-deploy-and-submit-jobs-to-a-spark-cluster-on-ec2-tp13703p13809.html To unsubscribe from Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2, click here http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=13703code=cGlzaGVuMDJAZ21haWwuY29tfDEzNzAzfC0xNjIyODI0MzY2 . NAML http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Introduce-a-sbt-plugin-to-deploy-and-submit-jobs-to-a-spark-cluster-on-ec2-tp13703p13810.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Spark (1.2.0) submit fails with exception saying log directory already exists
This probably means your app is failing and the second attempt is hitting that issue. You may fix the directory already exists error by setting spark.eventLog.overwrite=true in your conf, but most probably that will just expose the actual error in your app. On Tue, Aug 25, 2015 at 9:37 AM, Varadhan, Jawahar varad...@yahoo.com.invalid wrote: Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated. Thanks Jay -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?
Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm planning to close the survey to further responses early next week. If you haven't chimed in yet, the link to the survey is here: http://goo.gl/forms/erct2s6KRR We already have some great responses, which you can view. I'll share a summary after the survey is closed. Cheers! Nick On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks! I’m interested in hearing about what people think of spark-ec2 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the formal JIRA process. Your answers will all be anonymous and public. If the embedded form below doesn’t work for you, you can use this link to get the same survey: http://goo.gl/forms/erct2s6KRR Cheers! Nick
Spark (1.2.0) submit fails with exception saying log directory already exists
Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated. ThanksJay
Re: Spark builds: allow user override of project version at buildtime
On Tue, Aug 25, 2015 at 2:17 AM, andrew.row...@thomsonreuters.com wrote: Then, if I wanted to do a build against a specific profile, I could also pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts correctly named. The default behaviour should be the same. Child pom files would need to reference ${spark.version} in their parent section I think. Any objections to this? Have you tried it? My understanding is that no project does that because it doesn't work. To resolve properties you need to read the parent pom(s), and if there's a variable reference there, well, you can't do it. Chicken egg. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 1.5.0 (RC2)
Please vote on releasing the following candidate as Apache Spark version 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc2: https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release (published as 1.5.0-rc2) can be found at: https://repository.apache.org/content/repositories/orgapachespark-1141/ The staging repository for this release (published as 1.5.0) can be found at: https://repository.apache.org/content/repositories/orgapachespark-1140/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/ === How can I help test this release? === If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions. What justifies a -1 vote for this release? This vote is happening towards the end of the 1.5 QA period, so -1 votes should only occur for significant regressions from 1.4. Bugs already present in 1.4, minor regressions, or bugs related to new features will not block this release. === What should happen to JIRA tickets still targeting 1.5.0? === 1. It is OK for documentation patches to target 1.5.0 and still go into branch-1.5, since documentations will be packaged separately from the release. 2. New features for non-alpha-modules should target 1.6+. 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target version. == Major changes to help you focus your testing == As of today, Spark 1.5 contains more than 1000 commits from 220+ contributors. I've curated a list of important changes for 1.5. For the complete list, please refer to Apache JIRA changelog. RDD/DataFrame/SQL APIs - New UDAF interface - DataFrame hints for broadcast join - expr function for turning a SQL expression into DataFrame column - Improved support for NaN values - StructType now supports ordering - TimestampType precision is reduced to 1us - 100 new built-in expressions, including date/time, string, math - memory and local disk only checkpointing DataFrame/SQL Backend Execution - Code generation on by default - Improved join, aggregation, shuffle, sorting with cache friendly algorithms and external algorithms - Improved window function performance - Better metrics instrumentation and reporting for DF/SQL execution plans Data Sources, Hive, Hadoop, Mesos and Cluster Management - Dynamic allocation support in all resource managers (Mesos, YARN, Standalone) - Improved Mesos support (framework authentication, roles, dynamic allocation, constraints) - Improved YARN support (dynamic allocation with preferred locations) - Improved Hive support (metastore partition pruning, metastore connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2) - Support persisting data in Hive compatible format in metastore - Support data partitioning for JSON data sources - Parquet improvements (upgrade to 1.7, predicate pushdown, faster metadata discovery and schema merging, support reading non-standard legacy Parquet files generated by other libraries) - Faster and more robust dynamic partition insert - DataSourceRegister interface for external data sources to specify short names SparkR - YARN cluster mode in R - GLMs with R formula, binomial/Gaussian families, and elastic-net regularization - Improved error messages - Aliases to make DataFrame functions more R-like Streaming - Backpressure for handling bursty input streams. - Improved Python support for streaming sources (Kafka offsets, Kinesis, MQTT, Flume) - Improved Python streaming machine learning algorithms (K-Means, linear regression, logistic regression) - Native reliable Kinesis stream support - Input metadata like Kafka offsets made visible in the batch details UI - Better load balancing and scheduling of receivers across cluster - Include streaming storage in web UI Machine Learning and Advanced Analytics - Feature transformers: CountVectorizer, Discrete Cosine transformation, MinMaxScaler, NGram, PCA, RFormula,
RE: Dataframe aggregation with Tungsten unsafe
Hi, Reynold and others I agree with your comments on mid-tenured objects and GC. In fact, dealing with mid-tenured objects are the major challenge for all java GC implementations. I am wondering if anyone has played -XX:+PrintTenuringDistribution flags and see how exactly ages distribution look like when your program runs? My output with -XX:+PrintGCDetails look like below: (Oracle jdk8 update 60 http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Age 1-5 are young guys, 13, 14, 15 are old guys. The middle guys will have to be copied multiple times before become dead in old regions normally need some major GC to clean them up. Desired survivor size 2583691264 bytes, new threshold 15 (max 15) - age 1: 13474960 bytes, 13474960 total - age 2:2815592 bytes, 16290552 total - age 3: 632784 bytes, 16923336 total - age 4: 428432 bytes, 17351768 total - age 5: 648696 bytes, 18000464 total - age 6: 572328 bytes, 18572792 total - age 7: 549216 bytes, 19122008 total - age 8: 539544 bytes, 19661552 total - age 9: 422256 bytes, 20083808 total - age 10: 552928 bytes, 20636736 total - age 11: 430464 bytes, 21067200 total - age 12: 753320 bytes, 21820520 total - age 13: 230864 bytes, 22051384 total - age 14: 276288 bytes, 22327672 total - age 15: 809272 bytes, 23136944 total I’d love to see how others’ objects’ age distribution look like. Actually once we know the age distribution for some particular use cases, we can find a ways to avoid Full GC. Full GC is expensive because both CMS and G1 Full GC are single threaded. GC tuning nowadays becomes a task of just trying to avoid Full GC completely. Thanks -yanping From: Reynold Xin [mailto:r...@databricks.com] Sent: Tuesday, August 25, 2015 6:05 PM To: Ulanov, Alexander Cc: dev@spark.apache.org Subject: Re: Dataframe aggregation with Tungsten unsafe There are a lot of GC activity due to the non-code-gen path being sloppy about garbage creation. This is not actually what happens, but just as an example: rdd.map { i: Int = i + 1 } This under the hood becomes a closure that boxes on every input and every output, creating two extra objects. The reality is more complicated than this -- but here's a simpler view of what happens with GC in these cases. You might've heard from other places that the JVM is very efficient about transient object allocations. That is true when you look at these allocations in isolation, but unfortunately not true when you look at them in aggregate. First, due to the way the iterator interface is constructed, it is hard for the JIT compiler to on-stack allocate these objects. Then two things happen: 1. They pile up and cause more young gen GCs to happen. 2. After a few young gen GCs, some mid-tenured objects (e.g. an aggregation map) get copied into the old-gen, and eventually requires a full GC to free them. Full GCs are much more expensive than young gen GCs (usually involves copying all the data in the old gen). So the more garbages that are created - the more frequently full GC happens. The more long lived objects in the old gen (e.g. cache) - the more expensive full GC is. On Tue, Aug 25, 2015 at 5:19 PM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup? Reynold Xin r...@databricks.commailto:r...@databricks.commailto:r...@databricks.commailto:r...@databricks.com On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of in-memory data and parquet data aggregation is similar. Could anyone comment on this? It seems counterintuitive to me. Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However, local mode is not interesting. I think a large part of that is coming from the pressure created by JVM GC. Putting more data in-memory makes GC worse, unless GC is well tuned.
Re: [VOTE] Release Apache Spark 1.5.0 (RC1)
Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 1.5.0 Documentation | | | | | | | | | Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames DataFrame Operations Running SQL Queries Programmatically Interoperating with RDDs | | | | View on people.apache.org | Preview by Yahoo | | | | | it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark. Tom On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For example you can see assemblies like this are nearly empty: https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/ Just a publishing glitch but worth a few more eyes on. On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote: Signatures, license, etc. look good. I'm getting some fairly consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they are likely just test problems, but worth asking. Stack traces are at the end. There are currently 79 issues targeted for 1.5.0, of which 19 are bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.) That's significantly better than at the last release. I presume a lot of what's still targeted is not critical and can now be untargeted/retargeted. It occurs to me that the flurry of planning that took place at the start of the 1.5 QA cycle a few weeks ago was quite helpful, and is the kind of thing that would be even more useful at the start of a release cycle. So would be great to do this for 1.6 in a few weeks. Indeed there are already 267 issues targeted for 1.6.0 -- a decent roadmap already. Test failures: Core - Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 1 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313) at org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... Streaming - stop slow receiver gracefully *** FAILED *** 0 was not greater than 0 (StreamingContextSuite.scala:324) Kafka - offset recovery *** FAILED *** The code passed to eventually never returned normally. Attempted 191 times over 10.043196973 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc1: https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1137/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-docs/ === == How can I help test this release? == === If you are a Spark user, you can help us test this release
Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2
You can add it to the spark packages i guess http://spark-packages.org/ Thanks Best Regards On Fri, Aug 14, 2015 at 1:45 PM, pishen tsai pishe...@gmail.com wrote: Sorry for previous line-breaking format, try to resend the mail again. I have written a sbt plugin called spark-deployer, which is able to deploy a standalone spark cluster on aws ec2 and submit jobs to it. https://github.com/pishen/spark-deployer Compared to current spark-ec2 script, this design may have several benefits (features): 1. All the code are written in Scala. 2. Just add one line in your project/plugins.sbt and you are ready to go. (You don't have to download the python code and store it at someplace.) 3. The whole development flow (write code for spark job, compile the code, launch the cluster, assembly and submit the job to master, terminate the cluster when the job is finished) can be done in sbt. 4. Support parallel deployment of the worker machines by Scala's Future. 5. Allow dynamically add or remove worker machines to/from the current cluster. 6. All the configurations are stored in a typesafe config file. You don't need to store it elsewhere and map the settings into spark-ec2's command line arguments. 7. The core library is separated from sbt plugin, hence it's possible to execute the deployment from an environment without sbt (only JVM is required). 8. Support adjustable ec2 root disk size, custom security groups, custom ami (can run on default Amazon ami), custom spark tarball, and VPC. (Well, most of these are also supported in spark-ec2 in slightly different form, just mention it anyway.) Since this project is still in its early stage, it lacks some features of spark-ec2 such as self-installed HDFS (we use s3 directly), stoppable cluster, ganglia, and the copy script. However, it's already usable for our company and we are trying to move our production spark projects from spark-ec2 to spark-deployer. Any suggestion, testing help, or pull request are highly appreciated. On top of that, I would like to contribute this project to Spark, maybe as another choice (suggestion link) alongside spark-ec2 on Spark's official documentation. Of course, before that, I have to make this project stable enough (strange errors just happen on aws api from time to time). I'm wondering if this kind of contribution is possible and is there any rule to follow or anyone to contact? (Maybe the source code will not be merged into spark's main repository, since I've noticed that spark-ec2 is also planning to move out.) Regards, Pishen Tsai
Spark builds: allow user override of project version at buildtime
I've got an interesting challenge in building Spark. For various reasons we do a few different builds of spark, typically with a few different profile options (e.g. against different versions of Hadoop, some with/without Hive etc.). We mirror the spark repo internally and have a buildserver that builds and publishes different Spark versions to an artifactory server. The problem is that the output of each build is published with the version that is in the pom.xml file - a build of Spark @tags/v1.4.1 always comes out with an artefact version of '1.4.1'. However, because we may have three different Spark builds for 1.4.1, it'd be useful to be able to override this version at build time, so that we can publish 1.4.1, 1.4.1-cdh5.3.3 and maybe 1.4.1-cdh5.3.3-hive as separate artifacts. My understanding of maven is that the /project/version value in the pom.xml isn't overridable. At the moment, I've hacked around this by having a pre-build task that rewrites the various pom files and adjust the version to a string that's correct for that particular build. Would it be useful to instead populate the version from a maven property, which could then be overridable on the CLI? Something like: project version${spark.version}/version properties spark.version1.4.1/version /properties /project Then, if I wanted to do a build against a specific profile, I could also pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts correctly named. The default behaviour should be the same. Child pom files would need to reference ${spark.version} in their parent section I think. Any objections to this? Andrew smime.p7s Description: S/MIME cryptographic signature
Paring down / tagging tests (or some other way to avoid timeouts)?
Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s CliSuite 410.52s HashJoinCompatibilitySuite 2508.61s HiveCompatibilitySuite Looking at those, I'm not surprised at all that we see so many timeouts. Is there any ongoing effort to trim down those tests (especially HiveCompatibilitySuite) or somehow restrict when they're run? Almost 1 hour to run a single test suite that affects a rather isolated part of the code base looks a little excessive to me. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Paring down / tagging tests (or some other way to avoid timeouts)?
I'd be okay skipping the HiveCompatibilitySuite for core-only changes. They do often catch bugs in changes to catalyst or sql though. Same for HashJoinCompatibilitySuite/VersionsSuite. HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things like addJar that have been broken by core in the past. On Tue, Aug 25, 2015 at 1:40 PM, Patrick Wendell pwend...@gmail.com wrote: There is already code in place that restricts which tests run depending on which code is modified. However, changes inside of Spark's core currently require running all dependent tests. If you have some ideas about how to improve that heuristic, it would be great. - Patrick On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s CliSuite 410.52s HashJoinCompatibilitySuite 2508.61s HiveCompatibilitySuite Looking at those, I'm not surprised at all that we see so many timeouts. Is there any ongoing effort to trim down those tests (especially HiveCompatibilitySuite) or somehow restrict when they're run? Almost 1 hour to run a single test suite that affects a rather isolated part of the code base looks a little excessive to me. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.5.0 (RC1)
Anyone using HiveContext with secure Hive with Spark 1.5 and have it working? We have a non standard version of hive but was pulling our hive jars and its failing to authenticate. It could be something in our hive version but wondering if spark isn't forwarding credentials properly. Tom On Tuesday, August 25, 2015 1:56 PM, Tom Graves tgraves...@yahoo.com.INVALID wrote: Is there a jira to update the sql hive docs?Spark SQL and DataFrames - Spark 1.5.0 Documentation | | | | | | | | | Spark SQL and DataFrames - Spark 1.5.0 DocumentationSpark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames DataFrame Operations Running SQL Queries Programmatically Interoperating with RDDs | | | | View on people.apache.org | Preview by Yahoo | | | | | it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark. Tom On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For example you can see assemblies like this are nearly empty: https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/ Just a publishing glitch but worth a few more eyes on. On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote: Signatures, license, etc. look good. I'm getting some fairly consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they are likely just test problems, but worth asking. Stack traces are at the end. There are currently 79 issues targeted for 1.5.0, of which 19 are bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.) That's significantly better than at the last release. I presume a lot of what's still targeted is not critical and can now be untargeted/retargeted. It occurs to me that the flurry of planning that took place at the start of the 1.5 QA cycle a few weeks ago was quite helpful, and is the kind of thing that would be even more useful at the start of a release cycle. So would be great to do this for 1.6 in a few weeks. Indeed there are already 267 issues targeted for 1.6.0 -- a decent roadmap already. Test failures: Core - Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 1 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313) at org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... Streaming - stop slow receiver gracefully *** FAILED *** 0 was not greater than 0 (StreamingContextSuite.scala:324) Kafka - offset recovery *** FAILED *** The code passed to eventually never returned normally. Attempted 191 times over 10.043196973 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc1: https://github.com/apache/spark/tree/4c56ad772637615cc1f4f88d619fac6c372c8552 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc1-bin/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at:
Re: Paring down / tagging tests (or some other way to avoid timeouts)?
There is already code in place that restricts which tests run depending on which code is modified. However, changes inside of Spark's core currently require running all dependent tests. If you have some ideas about how to improve that heuristic, it would be great. - Patrick On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s CliSuite 410.52s HashJoinCompatibilitySuite 2508.61s HiveCompatibilitySuite Looking at those, I'm not surprised at all that we see so many timeouts. Is there any ongoing effort to trim down those tests (especially HiveCompatibilitySuite) or somehow restrict when they're run? Almost 1 hour to run a single test suite that affects a rather isolated part of the code base looks a little excessive to me. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Dataframe aggregation with Tungsten unsafe
Thank you for the explanation. The size if the 100M data is ~1.4GB in memory and each worker has 32GB of memory. It seems to be a lot of free memory available. I wonder how Spark can hit GC with such setup? Reynold Xin r...@databricks.commailto:r...@databricks.com On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of in-memory data and parquet data aggregation is similar. Could anyone comment on this? It seems counterintuitive to me. Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However, local mode is not interesting. I think a large part of that is coming from the pressure created by JVM GC. Putting more data in-memory makes GC worse, unless GC is well tuned. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.5.0 (RC1)
It works for me in cluster mode. I’m running on Hortonworks 2.2.4.12 in secure mode with Hive 0.14 I built with ./make-distribution —tgz -Phive -Phive-thriftserver -Phbase-provided -Pyarn -Phadoop-2.6 Doug On Aug 25, 2015, at 4:56 PM, Tom Graves tgraves...@yahoo.com.INVALID wrote: Anyone using HiveContext with secure Hive with Spark 1.5 and have it working? We have a non standard version of hive but was pulling our hive jars and its failing to authenticate. It could be something in our hive version but wondering if spark isn't forwarding credentials properly. Tom On Tuesday, August 25, 2015 1:56 PM, Tom Graves tgraves...@yahoo.com.INVALID wrote: Is there a jira to update the sql hive docs? Spark SQL and DataFrames - Spark 1.5.0 Documentation Spark SQL and DataFrames - Spark 1.5.0 Documentation Spark SQL and DataFrame Guide Overview DataFrames Starting Point: SQLContext Creating DataFrames DataFrame Operations Running SQL Queries Programmatically Interoperating with RDDs View on people.apache.org Preview by Yahoo it still says default is 0.13.1 but pom file builds with hive 1.2.1-spark. Tom On Monday, August 24, 2015 4:06 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I see that there's an 1.5.0-rc2 tag in github now. Is that the official RC2 tag to start trying out? -Sandy On Mon, Aug 24, 2015 at 7:23 AM, Sean Owen so...@cloudera.com wrote: PS Shixiong Zhu is correct that this one has to be fixed: https://issues.apache.org/jira/browse/SPARK-10168 For example you can see assemblies like this are nearly empty: https://repository.apache.org/content/repositories/orgapachespark-1137/org/apache/spark/spark-streaming-flume-assembly_2.10/1.5.0-rc1/ Just a publishing glitch but worth a few more eyes on. On Fri, Aug 21, 2015 at 5:28 PM, Sean Owen so...@cloudera.com wrote: Signatures, license, etc. look good. I'm getting some fairly consistent failures using Java 7 + Ubuntu 15 + -Pyarn -Phive -Phive-thriftserver -Phadoop-2.6 -- does anyone else see these? they are likely just test problems, but worth asking. Stack traces are at the end. There are currently 79 issues targeted for 1.5.0, of which 19 are bugs, of which 1 is a blocker. (1032 have been resolved for 1.5.0.) That's significantly better than at the last release. I presume a lot of what's still targeted is not critical and can now be untargeted/retargeted. It occurs to me that the flurry of planning that took place at the start of the 1.5 QA cycle a few weeks ago was quite helpful, and is the kind of thing that would be even more useful at the start of a release cycle. So would be great to do this for 1.6 in a few weeks. Indeed there are already 267 issues targeted for 1.6.0 -- a decent roadmap already. Test failures: Core - Unpersisting TorrentBroadcast on executors and driver in distributed mode *** FAILED *** java.util.concurrent.TimeoutException: Can't find 2 executors before 1 milliseconds elapsed at org.apache.spark.ui.jobs.JobProgressListener.waitUntilExecutorsUp(JobProgressListener.scala:561) at org.apache.spark.broadcast.BroadcastSuite.testUnpersistBroadcast(BroadcastSuite.scala:313) at org.apache.spark.broadcast.BroadcastSuite.org$apache$spark$broadcast$BroadcastSuite$$testUnpersistTorrentBroadcast(BroadcastSuite.scala:287) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply$mcV$sp(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.apache.spark.broadcast.BroadcastSuite$$anonfun$16.apply(BroadcastSuite.scala:165) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) ... Streaming - stop slow receiver gracefully *** FAILED *** 0 was not greater than 0 (StreamingContextSuite.scala:324) Kafka - offset recovery *** FAILED *** The code passed to eventually never returned normally. Attempted 191 times over 10.043196973 seconds. Last failure message: strings.forall({ ((elem: Any) = DirectKafkaStreamSuite.collectedData.contains(elem)) }) was false. (DirectKafkaStreamSuite.scala:249) On Fri, Aug 21, 2015 at 5:37 AM, Reynold Xin r...@databricks.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.5.0! The vote is open until Monday, Aug 17, 2015 at 20:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.5.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ The tag to be voted on is v1.5.0-rc1:
Re: Dataframe aggregation with Tungsten unsafe
On Fri, Aug 21, 2015 at 11:07 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: It seems that there is a nice improvement with Tungsten enabled given that data is persisted in memory 2x and 3x. However, the improvement is not that nice for parquet, it is 1.5x. What’s interesting, with Tungsten enabled performance of in-memory data and parquet data aggregation is similar. Could anyone comment on this? It seems counterintuitive to me. Local performance was not as good as Reynold had. I have around 1.5x, he had 5x. However, local mode is not interesting. I think a large part of that is coming from the pressure created by JVM GC. Putting more data in-memory makes GC worse, unless GC is well tuned.
Re: Paring down / tagging tests (or some other way to avoid timeouts)?
I chatted with Patrick briefly offline. It would be interesting to know whether the scripts have some way of saying run a smaller version of certain tests (e.g. by setting a system property that the tests look at to decide what to run). That way, if there are no changes under sql/, we could still run a small part of HiveCompatibilitySuite, just not all of it. The reasoning being that if a core change breaks something in Hive, it will probably break many tests, not a specific one. On Tue, Aug 25, 2015 at 1:48 PM, Michael Armbrust mich...@databricks.com wrote: I'd be okay skipping the HiveCompatibilitySuite for core-only changes. They do often catch bugs in changes to catalyst or sql though. Same for HashJoinCompatibilitySuite/VersionsSuite. HiveSparkSubmitSuite/CliSuite should probably stay, as they do test things like addJar that have been broken by core in the past. On Tue, Aug 25, 2015 at 1:40 PM, Patrick Wendell pwend...@gmail.com wrote: There is already code in place that restricts which tests run depending on which code is modified. However, changes inside of Spark's core currently require running all dependent tests. If you have some ideas about how to improve that heuristic, it would be great. - Patrick On Tue, Aug 25, 2015 at 1:33 PM, Marcelo Vanzin van...@cloudera.com wrote: Hello y'all, So I've been getting kinda annoyed with how many PR tests have been timing out. I took one of the logs from one of my PRs and started to do some crunching on the data from the output, and here's a list of the 5 slowest suites: 307.14s HiveSparkSubmitSuite 382.641s VersionsSuite 398s CliSuite 410.52s HashJoinCompatibilitySuite 2508.61s HiveCompatibilitySuite Looking at those, I'm not surprised at all that we see so many timeouts. Is there any ongoing effort to trim down those tests (especially HiveCompatibilitySuite) or somehow restrict when they're run? Almost 1 hour to run a single test suite that affects a rather isolated part of the code base looks a little excessive to me. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark builds: allow user override of project version at buildtime
This isn't really answering the question, but for what it is worth, I manage several different branches of Spark and publish custom named versions regularly to an internal repository, and this is *much* easier with SBT than with maven. You can actually link the Spark SBT build into an external SBT build and write commands that cross publish as needed. For your case something as simple as build/sbt set version in Global := '1.4.1-custom-string' publish might do the trick. On Tue, Aug 25, 2015 at 10:09 AM, Marcelo Vanzin van...@cloudera.com wrote: On Tue, Aug 25, 2015 at 2:17 AM, andrew.row...@thomsonreuters.com wrote: Then, if I wanted to do a build against a specific profile, I could also pass in a -Dspark.version=1.4.1-custom-string and have the output artifacts correctly named. The default behaviour should be the same. Child pom files would need to reference ${spark.version} in their parent section I think. Any objections to this? Have you tried it? My understanding is that no project does that because it doesn't work. To resolve properties you need to read the parent pom(s), and if there's a variable reference there, well, you can't do it. Chicken egg. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org