Re: Announcing the official Spark Job Server repo
Andy, doesn't Marathon handle fault tolerance amongst its apps? ie if you say that N instances of an app are running, and one shuts off, then it spins up another one no? The tricky thing was that I was planning to use Akka Cluster to coordinate, but Mesos itself can be used to coordinate as well, which is an overlap/ but I didn't want ot make job server HA just reliant only on Mesos... Anyways we can discuss offline if needed. On Thu, Mar 20, 2014 at 1:35 AM, andy petrella andy.petre...@gmail.com wrote: Heya, That's cool you've already hacked something for this in the scripts! I have a related question, how would it work actually. I mean, to have this Job Server fault tolerant using Marathon, I would guess that it will need to be itself a Mesos framework, and able to publish its resources needs. And also, for that, the Job Server has to be aware of the resources needed by the Spark drivers that it will run, which is not as easy to guess, unless it is provided by the job itself? I didn't checked the Job Server deep enough so it might be already the case (or I'm expressing something completely dumb ^^). For sure, we'll try to share it when we'll reach this point to deploy using marathon (should be planned for April) greetz and again, Nice Work Evan! Ndi On Wed, Mar 19, 2014 at 7:27 AM, Evan Chan e...@ooyala.com wrote: Andy, Yeah, we've thought of deploying this on Marathon ourselves, but we're not sure how much Mesos we're going to use yet. (Indeed if you look at bin/server_start.sh, I think I set up the PORT environment var specifically for Marathon.)This is also why we have deploy scripts which package into .tar.gz, again for Mesos deployment. If you do try this, please let us know. :) -Evan On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com wrote: tad! That's awesome. A quick question, does someone has insights regarding having such JobServers deployed using Marathon on Mesos? I'm thinking about an arch where Marathon would deploy and keep the Job Servers running along with part of the whole set of apps deployed on it regarding the resources needed (à la Jenkins). Any idea is welcome. Back to the news, Evan + Ooyala team: Great Job again. andy On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra henry.sapu...@gmail.com wrote: W00t! Thanks for releasing this, Evan. - Henry On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
I also have a really minor fix for SPARK-1057 (upgrading fastutil), could that also make it in? -Evan On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Sorry this request is coming in a bit late, but would it be possible to backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing executor offers and I would like to use this in a release sooner rather than later. Thanks Shivaram [1] https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com wrote: Thank You! We plan to test out 0.9.1 on YARN once it is out. Regards, Bhaskar On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: new Catalyst/SQL component merged into master
Hi Michael, Congrats, this is really neat! What thoughts do you have regarding adding indexing support and predicate pushdown to this SQL framework?Right now we have custom bitmap indexing to speed up queries, so we're really curious as far as the architectural direction. -Evan On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust mich...@databricks.com wrote: It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Does this feature has different usecases than shark or more cleaner as hive dependency is gone? Depending on how you use this, there is still a dependency on Hive (By default this is not the case. See the above documentation for more details). However, the dependency is on a stock version of Hive instead of one modified by the AMPLab. Furthermore, Spark SQL has its own optimizer, instead of relying on the Hive optimizer. Long term, this is going to give us a lot more flexibility to optimize queries specifically for the Spark execution engine. We are actively porting over the best parts of shark (specifically the in-memory columnar representation). Shark still has some features that are missing in Spark SQL, including SharkServer (and years of testing). Once SparkSQL graduates from Alpha status, it'll likely become the new backend for Shark. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
@Tathagata, the PR is here: https://github.com/apache/spark/pull/215 On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das tathagata.das1...@gmail.com wrote: @Shivaram, That is a useful patch but I am bit afraid merge it in. Randomizing the executor has performance implications, especially for Spark Streaming. The non-randomized ordering of allocating machines to tasks was subtly helping to speed up certain window-based shuffle operations. For example, corresponding shuffle partitions in multiple shuffles using the same partitioner were likely to be co-located, that is, shuffle partition 0 were likely to be on the same machine for multiple shuffles. While this is the not a reliable mechanism to rely on, randomization may lead to performance degradation. So I am afraid to merge this one without understanding the consequences. @Evan, I have already cut a release! You can submit the PR and we can merge it branch-0.9. If we have to cut another release, then we can include it. On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote: I also have a really minor fix for SPARK-1057 (upgrading fastutil), could that also make it in? -Evan On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Sorry this request is coming in a bit late, but would it be possible to backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing executor offers and I would like to use this in a release sooner rather than later. Thanks Shivaram [1] https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com wrote: Thank You! We plan to test out 0.9.1 on YARN once it is out. Regards, Bhaskar On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: spark jobserver
Suhas, here is the update, which I posted to SPARK-818: An update: we have put up the final job server here: https://github.com/ooyala/spark-jobserver The plan is to have a spark-contrib repo/github account and this would be one of the first projects. See SPARK-1283 for the ticket to track spark-contrib. On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish suhas.sat...@gmail.com wrote: Any plans of integrating SPARK-818 into spark trunk ? The pull request is open. It offers spark as a service with spark jobserver running as a separate process. Thanks, Suhas. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
Hey Evan and TD, Spark's dependency graph in a maintenance release seems potentially harmful, especially upgrading a minor version (not just a patch version) like this. This could affect other downstream users. For instance, now without knowing their fastutil dependency gets bumped and they hit some new problem in fastutil 6.5. - Patrick On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das tathagata.das1...@gmail.com wrote: @Shivaram, That is a useful patch but I am bit afraid merge it in. Randomizing the executor has performance implications, especially for Spark Streaming. The non-randomized ordering of allocating machines to tasks was subtly helping to speed up certain window-based shuffle operations. For example, corresponding shuffle partitions in multiple shuffles using the same partitioner were likely to be co-located, that is, shuffle partition 0 were likely to be on the same machine for multiple shuffles. While this is the not a reliable mechanism to rely on, randomization may lead to performance degradation. So I am afraid to merge this one without understanding the consequences. @Evan, I have already cut a release! You can submit the PR and we can merge it branch-0.9. If we have to cut another release, then we can include it. On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote: I also have a really minor fix for SPARK-1057 (upgrading fastutil), could that also make it in? -Evan On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Sorry this request is coming in a bit late, but would it be possible to backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing executor offers and I would like to use this in a release sooner rather than later. Thanks Shivaram [1] https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com wrote: Thank You! We plan to test out 0.9.1 on YARN once it is out. Regards, Bhaskar On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
Spark's dependency graph in a maintenance *Modifying* Spark's dependency graph...
Re: Spark 0.9.1 release
Patrick, that is a good point. On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell pwend...@gmail.comwrote: Spark's dependency graph in a maintenance *Modifying* Spark's dependency graph...
Re: Announcing the official Spark Job Server repo
Thx for answering! see inline for my thoughts (or misunderstanding ? ^^) Andy, doesn't Marathon handle fault tolerance amongst its apps? ie if you say that N instances of an app are running, and one shuts off, then it spins up another one no? Yes indeed, but my wonder is about how to know how many instances we need? You know, it's purely dependent of the amount of resource consumed by the drivers, so it fluctuates with the time. In my actual thinking, the JobServer could ask mesos for resources depending on the amount of resources of its currently managed job list (so themselves should be able to deliver such info). Then (perhaps) marathon can be (hot-)tuned to maintain N+M or N-M instances depending of the load... But maybe am I crossing some boundaries, the ones with auto-scaling :-/ The tricky thing was that I was planning to use Akka Cluster to coordinate, but Mesos itself can be used to coordinate as well, which is an overlap/ but I didn't want ot make job server HA just reliant only on Mesos... You mean using Akka cluster to dispatch jobs on the managed (Job Server) nodes? That's something actually interesting as well, but I guess would require some duplicated work with what Mesos or Yarn are doing (that is resources management) right? Anyways we can discuss offline if needed. Definitively, let's stop polluting the list !!! C ya andy On Thu, Mar 20, 2014 at 1:35 AM, andy petrella andy.petre...@gmail.com wrote: Heya, That's cool you've already hacked something for this in the scripts! I have a related question, how would it work actually. I mean, to have this Job Server fault tolerant using Marathon, I would guess that it will need to be itself a Mesos framework, and able to publish its resources needs. And also, for that, the Job Server has to be aware of the resources needed by the Spark drivers that it will run, which is not as easy to guess, unless it is provided by the job itself? I didn't checked the Job Server deep enough so it might be already the case (or I'm expressing something completely dumb ^^). For sure, we'll try to share it when we'll reach this point to deploy using marathon (should be planned for April) greetz and again, Nice Work Evan! Ndi On Wed, Mar 19, 2014 at 7:27 AM, Evan Chan e...@ooyala.com wrote: Andy, Yeah, we've thought of deploying this on Marathon ourselves, but we're not sure how much Mesos we're going to use yet. (Indeed if you look at bin/server_start.sh, I think I set up the PORT environment var specifically for Marathon.)This is also why we have deploy scripts which package into .tar.gz, again for Mesos deployment. If you do try this, please let us know. :) -Evan On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com wrote: tad! That's awesome. A quick question, does someone has insights regarding having such JobServers deployed using Marathon on Mesos? I'm thinking about an arch where Marathon would deploy and keep the Job Servers running along with part of the whole set of apps deployed on it regarding the resources needed (à la Jenkins). Any idea is welcome. Back to the news, Evan + Ooyala team: Great Job again. andy On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra henry.sapu...@gmail.com wrote: W00t! Thanks for releasing this, Evan. - Henry On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote: Dear Spark developers, Ooyala is happy to announce that we have pushed our official, Spark 0.9.0 / Scala 2.10-compatible, job server as a github repo: https://github.com/ooyala/spark-jobserver Complete with unit tests, deploy scripts, and examples. The original PR (#222) on incubator-spark is now closed. Please have a look; pull requests are very welcome. -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com | -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: spark jobserver
Thanks a lot for this update Evan , really appreciate the effort. On Monday, March 24, 2014, Evan Chan e...@ooyala.com wrote: Suhas, here is the update, which I posted to SPARK-818: An update: we have put up the final job server here: https://github.com/ooyala/spark-jobserver The plan is to have a spark-contrib repo/github account and this would be one of the first projects. See SPARK-1283 for the ticket to track spark-contrib. On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish suhas.sat...@gmail.comjavascript:; wrote: Any plans of integrating SPARK-818 into spark trunk ? The pull request is open. It offers spark as a service with spark jobserver running as a separate process. Thanks, Suhas. -- -- Evan Chan Staff Engineer e...@ooyala.com javascript:; | -- Cheers, Suhas.
Re: Spark 0.9.1 release
Patrick, yes, that is indeed a risk. On Mon, Mar 24, 2014 at 12:30 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Patrick, that is a good point. On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell pwend...@gmail.comwrote: Spark's dependency graph in a maintenance *Modifying* Spark's dependency graph... -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: spark jobserver
Suhas, You're welcome. We are planning to speak about the job server at the Spark Summit by the way. -Evan On Mon, Mar 24, 2014 at 9:38 AM, Suhas Satish suhas.sat...@gmail.com wrote: Thanks a lot for this update Evan , really appreciate the effort. On Monday, March 24, 2014, Evan Chan e...@ooyala.com wrote: Suhas, here is the update, which I posted to SPARK-818: An update: we have put up the final job server here: https://github.com/ooyala/spark-jobserver The plan is to have a spark-contrib repo/github account and this would be one of the first projects. See SPARK-1283 for the ticket to track spark-contrib. On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish suhas.sat...@gmail.comjavascript:; wrote: Any plans of integrating SPARK-818 into spark trunk ? The pull request is open. It offers spark as a service with spark jobserver running as a separate process. Thanks, Suhas. -- -- Evan Chan Staff Engineer e...@ooyala.com javascript:; | -- Cheers, Suhas. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: new Catalyst/SQL component merged into master
How does it compare against Shark, and what is the future of Shark with this new module in place? On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote: Hi Michael, Congrats, this is really neat! What thoughts do you have regarding adding indexing support and predicate pushdown to this SQL framework?Right now we have custom bitmap indexing to speed up queries, so we're really curious as far as the architectural direction. -Evan On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust mich...@databricks.com wrote: It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Does this feature has different usecases than shark or more cleaner as hive dependency is gone? Depending on how you use this, there is still a dependency on Hive (By default this is not the case. See the above documentation for more details). However, the dependency is on a stock version of Hive instead of one modified by the AMPLab. Furthermore, Spark SQL has its own optimizer, instead of relying on the Hive optimizer. Long term, this is going to give us a lot more flexibility to optimize queries specifically for the Spark execution engine. We are actively porting over the best parts of shark (specifically the in-memory columnar representation). Shark still has some features that are missing in Spark SQL, including SharkServer (and years of testing). Once SparkSQL graduates from Alpha status, it'll likely become the new backend for Shark. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: new Catalyst/SQL component merged into master
Hi Evan, Index support is definitely something we would like to add, and it is possible that adding support for your custom indexing solution would not be too difficult. We already push predicates into hive table scan operators when the predicates are over partition keys. You can see an example of how we collect filters and decide which can pushed into the scan using the HiveTableScan query planning strategyhttps://github.com/marmbrus/spark/blob/0ae86cfcba56b700d8e7bd869379f0c663b21c1e/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L56 . I'd like to know more about your indexing solution. Is this something publicly available? One concern here is that the query planning code is not considered a public API and so is likely to change quite a bit as we improve the optimizer. Its not currently something that we plan to expose for external components to modify. Michael On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote: Hi Michael, Congrats, this is really neat! What thoughts do you have regarding adding indexing support and predicate pushdown to this SQL framework?Right now we have custom bitmap indexing to speed up queries, so we're really curious as far as the architectural direction. -Evan On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust mich...@databricks.com wrote: It will be great if there are any examples or usecases to look at ? There are examples in the Spark documentation. Patrick posted and updated copy here so people can see them before 1.0 is released: http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html Does this feature has different usecases than shark or more cleaner as hive dependency is gone? Depending on how you use this, there is still a dependency on Hive (By default this is not the case. See the above documentation for more details). However, the dependency is on a stock version of Hive instead of one modified by the AMPLab. Furthermore, Spark SQL has its own optimizer, instead of relying on the Hive optimizer. Long term, this is going to give us a lot more flexibility to optimize queries specifically for the Spark execution engine. We are actively porting over the best parts of shark (specifically the in-memory columnar representation). Shark still has some features that are missing in Spark SQL, including SharkServer (and years of testing). Once SparkSQL graduates from Alpha status, it'll likely become the new backend for Shark. -- -- Evan Chan Staff Engineer e...@ooyala.com |
Re: Spark 0.9.1 release
1051 is essential! I'm not sure about the others, but anything that adds stability to Spark/Yarn would be helpful. Kevin Markey On 03/20/2014 01:12 PM, Tom Graves wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
1051 has been pulled in! search 1051 in https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-0.9 TD On Mon, Mar 24, 2014 at 4:26 PM, Kevin Markey kevin.mar...@oracle.com wrote: 1051 is essential! I'm not sure about the others, but anything that adds stability to Spark/Yarn would be helpful. Kevin Markey On 03/20/2014 01:12 PM, Tom Graves wrote: I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in. The pyspark one I would consider more of an enhancement so might not be appropriate for a point release. [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA... org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49) at org.apache.spark.schedule... View on spark-project.atlassian.net Preview by Yahoo [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA This means that they can't write/read from files that the yarn user doesn't have permissions to but the submitting user does. View on spark-project.atlassian.net Preview by Yahoo On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote: It will be great if SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101: Umbrella for hardening Spark on YARN can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das tathagata.das1...@gmail.comwrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordingly https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed) . Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
Is there any way that [SPARK-782] (Shade ASM) can be included? I see that it is not currently backported to 0.9. But there is no single issue that has caused us more grief as we integrate spark-core with other project dependencies. There are way too many libraries out there in addition to Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading), including some Hive and Hadoop libraries and a number of servlet libraries. We can't control those, but if Spark were well behaved in this regard, it would help. Even for a maintenance release, and even if 1.0 is only 6 weeks away! (For those not following 782, according to Jira comments, the SBT build shades it, but it is the Maven build that ends up in Maven Central.) Thanks Kevin Markey On 03/19/2014 06:07 PM, Tathagata Das wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed). Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD
Re: Spark 0.9.1 release
Hello Kevin, A fix for SPARK-782 would definitely simplify building against Spark. However, its possible that a fix for this issue in 0.9.1 will break the builds (that reference spark) of existing 0.9 users, either due to a change in the ASM version, or for being incompatible with their current workarounds for this issue. That is not a good idea for a maintenance release, especially when 1.0 is not too far away. Can you (and others) elaborate more on the current workarounds that you have for this issue? Its best to understand all the implications of this fix. Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven. TD On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey kevin.mar...@oracle.com wrote: Is there any way that [SPARK-782] (Shade ASM) can be included? I see that it is not currently backported to 0.9. But there is no single issue that has caused us more grief as we integrate spark-core with other project dependencies. There are way too many libraries out there in addition to Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading), including some Hive and Hadoop libraries and a number of servlet libraries. We can't control those, but if Spark were well behaved in this regard, it would help. Even for a maintenance release, and even if 1.0 is only 6 weeks away! (For those not following 782, according to Jira comments, the SBT build shades it, but it is the Maven build that ends up in Maven Central.) Thanks Kevin Markey On 03/19/2014 06:07 PM, Tathagata Das wrote: Hello everyone, Since the release of Spark 0.9, we have received a number of important bug fixes and we would like to make a bug-fix release of Spark 0.9.1. We are going to cut a release candidate soon and we would love it if people test it out. We have backported several bug fixes into the 0.9 and updated JIRA accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed). Please let me know if there are fixes that were not backported but you would like to see them in 0.9.1. Thanks! TD