Re: Announcing the official Spark Job Server repo

2014-03-24 Thread Evan Chan
Andy, doesn't Marathon handle fault tolerance amongst its apps?  ie if
you say that N instances of an app are running, and one shuts off,
then it spins up another one no?

The tricky thing was that I was planning to use Akka Cluster to
coordinate, but Mesos itself can be used to coordinate as well, which
is an overlap/ but I didn't want ot make job server HA just
reliant only on Mesos... Anyways we can discuss offline if needed.

On Thu, Mar 20, 2014 at 1:35 AM, andy petrella andy.petre...@gmail.com wrote:
 Heya,
 That's cool you've already hacked something for this in the scripts!

 I have a related question, how would it work actually. I mean, to have this
 Job Server fault tolerant using Marathon, I would guess that it will need
 to be itself a Mesos framework, and able to publish its resources needs.
 And also, for that, the Job Server has to be aware of the resources needed
 by the Spark drivers that it will run, which is not as easy to guess,
 unless it is provided by the job itself?

 I didn't checked the Job Server deep enough so it might be already the case
 (or I'm expressing something completely dumb ^^).

 For sure, we'll try to share it when we'll reach this point to deploy using
 marathon (should be planned for April)

 greetz and again, Nice Work Evan!

 Ndi

 On Wed, Mar 19, 2014 at 7:27 AM, Evan Chan e...@ooyala.com wrote:

 Andy,

 Yeah, we've thought of deploying this on Marathon ourselves, but we're
 not sure how much Mesos we're going to use yet.   (Indeed if you look
 at bin/server_start.sh, I think I set up the PORT environment var
 specifically for Marathon.)This is also why we have deploy scripts
 which package into .tar.gz, again for Mesos deployment.

 If you do try this, please let us know.  :)

 -Evan


 On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com
 wrote:
  tad! That's awesome.
 
  A quick question, does someone has insights regarding having such
  JobServers deployed using Marathon on Mesos?
 
  I'm thinking about an arch where Marathon would deploy and keep the Job
  Servers running along with part of the whole set of apps deployed on it
  regarding the resources needed (à la Jenkins).
 
  Any idea is welcome.
 
  Back to the news, Evan + Ooyala team: Great Job again.
 
  andy
 
  On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra henry.sapu...@gmail.com
 wrote:
 
  W00t!
 
  Thanks for releasing this, Evan.
 
  - Henry
 
  On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote:
   Dear Spark developers,
  
   Ooyala is happy to announce that we have pushed our official, Spark
   0.9.0 / Scala 2.10-compatible, job server as a github repo:
  
   https://github.com/ooyala/spark-jobserver
  
   Complete with unit tests, deploy scripts, and examples.
  
   The original PR (#222) on incubator-spark is now closed.
  
   Please have a look; pull requests are very welcome.
   --
   --
   Evan Chan
   Staff Engineer
   e...@ooyala.com  |
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Spark 0.9.1 release

2014-03-24 Thread Evan Chan
I also have a really minor fix for SPARK-1057  (upgrading fastutil),
could that also make it in?

-Evan


On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 Sorry this request is coming in a bit late, but would it be possible to
 backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
 executor offers and I would like to use this in a release sooner rather
 than later.

 Thanks
 Shivaram

 [1]
 https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd


 On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com wrote:

 Thank You! We plan to test out 0.9.1 on YARN once it is out.

 Regards,
 Bhaskar

 On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com wrote:

  I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running
  on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
  submitting user - JIRA in.  The pyspark one I would consider more of an
  enhancement so might not be appropriate for a point release.
 
 
   [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA...
  org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
  at org.apache.spark.schedule...
  View on spark-project.atlassian.net Preview by Yahoo
 
 
   [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
  This means that they can't write/read from files that the yarn user
  doesn't have permissions to but the submitting user does.
  View on spark-project.atlassian.net Preview by Yahoo
 
 
 
 
 
  On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
  wrote:
 
  It will be great if
  SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
  Umbrella
  for hardening Spark on YARN can get into 0.9.1.
 
  Thanks,
  Bhaskar
 
 
  On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
  tathagata.das1...@gmail.comwrote:
 
Hello everyone,
  
   Since the release of Spark 0.9, we have received a number of important
  bug
   fixes and we would like to make a bug-fix release of Spark 0.9.1. We
 are
   going to cut a release candidate soon and we would love it if people
 test
   it out. We have backported several bug fixes into the 0.9 and updated
  JIRA
   accordingly
  
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
   .
   Please let me know if there are fixes that were not backported but you
   would like to see them in 0.9.1.
  
   Thanks!
  
   TD
  
 




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: new Catalyst/SQL component merged into master

2014-03-24 Thread Evan Chan
Hi Michael,

Congrats, this is really neat!

What thoughts do you have regarding adding indexing support and
predicate pushdown to this SQL framework?Right now we have custom
bitmap indexing to speed up queries, so we're really curious as far as
the architectural direction.

-Evan


On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
mich...@databricks.com wrote:

 It will be great if there are any examples or usecases to look at ?

 There are examples in the Spark documentation.  Patrick posted and updated
 copy here so people can see them before 1.0 is released:
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html

 Does this feature has different usecases than shark or more cleaner as
 hive dependency is gone?

 Depending on how you use this, there is still a dependency on Hive (By
 default this is not the case.  See the above documentation for more
 details).  However, the dependency is on a stock version of Hive instead of
 one modified by the AMPLab.  Furthermore, Spark SQL has its own optimizer,
 instead of relying on the Hive optimizer.  Long term, this is going to give
 us a lot more flexibility to optimize queries specifically for the Spark
 execution engine.  We are actively porting over the best parts of shark
 (specifically the in-memory columnar representation).

 Shark still has some features that are missing in Spark SQL, including
 SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
 status, it'll likely become the new backend for Shark.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Spark 0.9.1 release

2014-03-24 Thread Evan Chan
@Tathagata,  the PR is here:
https://github.com/apache/spark/pull/215

On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 @Shivaram, That is a useful patch but I am bit afraid merge it in.
 Randomizing the executor has performance implications, especially for Spark
 Streaming. The non-randomized ordering of allocating machines to tasks was
 subtly helping to speed up certain window-based shuffle operations.  For
 example, corresponding shuffle partitions in multiple shuffles using the
 same partitioner were likely to be co-located, that is, shuffle partition 0
 were likely to be on the same machine for multiple shuffles. While this is
 the not a reliable mechanism to rely on, randomization may lead to
 performance degradation. So I am afraid to merge this one without
 understanding the consequences.

 @Evan, I have already cut a release! You can submit the PR and we can merge
 it branch-0.9. If we have to cut another release, then we can include it.



 On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote:

 I also have a really minor fix for SPARK-1057  (upgrading fastutil),
 could that also make it in?

 -Evan


 On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Sorry this request is coming in a bit late, but would it be possible to
  backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
  executor offers and I would like to use this in a release sooner rather
  than later.
 
  Thanks
  Shivaram
 
  [1]
 
 https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
 
 
  On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com
 wrote:
 
  Thank You! We plan to test out 0.9.1 on YARN once it is out.
 
  Regards,
  Bhaskar
 
  On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com
 wrote:
 
   I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when
 running
   on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
   submitting user - JIRA in.  The pyspark one I would consider more of
 an
   enhancement so might not be appropriate for a point release.
  
  
[SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
 YA...
   org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
  
 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
   at org.apache.spark.schedule...
   View on spark-project.atlassian.net Preview by Yahoo
  
  
[SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
   This means that they can't write/read from files that the yarn user
   doesn't have permissions to but the submitting user does.
   View on spark-project.atlassian.net Preview by Yahoo
  
  
  
  
  
   On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
 
   wrote:
  
   It will be great if
   SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
   Umbrella
   for hardening Spark on YARN can get into 0.9.1.
  
   Thanks,
   Bhaskar
  
  
   On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
   tathagata.das1...@gmail.comwrote:
  
 Hello everyone,
   
Since the release of Spark 0.9, we have received a number of
 important
   bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We
  are
going to cut a release candidate soon and we would love it if people
  test
it out. We have backported several bug fixes into the 0.9 and
 updated
   JIRA
accordingly
   
  
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
.
Please let me know if there are fixes that were not backported but
 you
would like to see them in 0.9.1.
   
Thanks!
   
TD
   
  
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: spark jobserver

2014-03-24 Thread Evan Chan
Suhas, here is the update, which I posted to SPARK-818:

An update: we have put up the final job server here:
https://github.com/ooyala/spark-jobserver

The plan is to have a spark-contrib repo/github account and this would
be one of the first projects.

See SPARK-1283 for the ticket to track spark-contrib.

On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish suhas.sat...@gmail.com wrote:
 Any plans of integrating SPARK-818 into spark trunk ? The pull request is
 open.
 It offers spark as a service with spark jobserver running as a separate
 process.


 Thanks,
 Suhas.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell
Hey Evan and TD,

Spark's dependency graph in a maintenance release seems potentially
harmful, especially upgrading a minor version (not just a patch
version) like this. This could affect other downstream users. For
instance, now without knowing their fastutil dependency gets bumped
and they hit some new problem in fastutil 6.5.

- Patrick

On Mon, Mar 24, 2014 at 12:02 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 @Shivaram, That is a useful patch but I am bit afraid merge it in.
 Randomizing the executor has performance implications, especially for Spark
 Streaming. The non-randomized ordering of allocating machines to tasks was
 subtly helping to speed up certain window-based shuffle operations.  For
 example, corresponding shuffle partitions in multiple shuffles using the
 same partitioner were likely to be co-located, that is, shuffle partition 0
 were likely to be on the same machine for multiple shuffles. While this is
 the not a reliable mechanism to rely on, randomization may lead to
 performance degradation. So I am afraid to merge this one without
 understanding the consequences.

 @Evan, I have already cut a release! You can submit the PR and we can merge
 it branch-0.9. If we have to cut another release, then we can include it.



 On Sun, Mar 23, 2014 at 11:42 PM, Evan Chan e...@ooyala.com wrote:

 I also have a really minor fix for SPARK-1057  (upgrading fastutil),
 could that also make it in?

 -Evan


 On Sun, Mar 23, 2014 at 11:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  Sorry this request is coming in a bit late, but would it be possible to
  backport SPARK-979[1] to branch-0.9 ? This is the patch for randomizing
  executor offers and I would like to use this in a release sooner rather
  than later.
 
  Thanks
  Shivaram
 
  [1]
 
 https://github.com/apache/spark/commit/556c56689bbc32c6cec0d07b57bd3ec73ceb243e#diff-8ef3258646b0e6a4793d6ad99848eacd
 
 
  On Thu, Mar 20, 2014 at 10:18 PM, Bhaskar Dutta bhas...@gmail.com
 wrote:
 
  Thank You! We plan to test out 0.9.1 on YARN once it is out.
 
  Regards,
  Bhaskar
 
  On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves tgraves...@yahoo.com
 wrote:
 
   I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when
 running
   on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as
   submitting user - JIRA in.  The pyspark one I would consider more of
 an
   enhancement so might not be appropriate for a point release.
  
  
[SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
 YA...
   org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
  
 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
   at org.apache.spark.schedule...
   View on spark-project.atlassian.net Preview by Yahoo
  
  
[SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
   This means that they can't write/read from files that the yarn user
   doesn't have permissions to but the submitting user does.
   View on spark-project.atlassian.net Preview by Yahoo
  
  
  
  
  
   On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
 
   wrote:
  
   It will be great if
   SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
   Umbrella
   for hardening Spark on YARN can get into 0.9.1.
  
   Thanks,
   Bhaskar
  
  
   On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
   tathagata.das1...@gmail.comwrote:
  
 Hello everyone,
   
Since the release of Spark 0.9, we have received a number of
 important
   bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We
  are
going to cut a release candidate soon and we would love it if people
  test
it out. We have backported several bug fixes into the 0.9 and
 updated
   JIRA
accordingly
   
  
 
 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)
.
Please let me know if there are fixes that were not backported but
 you
would like to see them in 0.9.1.
   
Thanks!
   
TD
   
  
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Re: Spark 0.9.1 release

2014-03-24 Thread Patrick Wendell
 Spark's dependency graph in a maintenance
*Modifying* Spark's dependency graph...


Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
Patrick, that is a good point.


On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell pwend...@gmail.comwrote:

  Spark's dependency graph in a maintenance
 *Modifying* Spark's dependency graph...



Re: Announcing the official Spark Job Server repo

2014-03-24 Thread andy petrella
Thx for answering!
see inline for my thoughts (or misunderstanding ? ^^)

Andy, doesn't Marathon handle fault tolerance amongst its apps?  ie if
 you say that N instances of an app are running, and one shuts off,
 then it spins up another one no?

Yes indeed, but my wonder is about how to know how many instances we need?
You know, it's purely dependent of the amount of resource consumed by the
drivers, so it fluctuates with the time.
In my actual thinking, the JobServer could ask mesos for resources
depending on the amount of resources of its currently managed job list (so
themselves should be able to deliver such info). Then (perhaps) marathon
can be (hot-)tuned to maintain N+M or N-M instances depending of the
load... But maybe am I crossing some boundaries, the ones with auto-scaling
:-/



 The tricky thing was that I was planning to use Akka Cluster to
 coordinate, but Mesos itself can be used to coordinate as well, which
 is an overlap/ but I didn't want ot make job server HA just
 reliant only on Mesos...

You mean using Akka cluster to dispatch jobs on the managed (Job Server)
nodes? That's something actually interesting as well, but I guess would
require some duplicated work with what Mesos or Yarn are doing (that is
resources management) right?


 Anyways we can discuss offline if needed.

Definitively, let's stop polluting the list !!!


C ya
andy


 On Thu, Mar 20, 2014 at 1:35 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Heya,
  That's cool you've already hacked something for this in the scripts!
 
  I have a related question, how would it work actually. I mean, to have
 this
  Job Server fault tolerant using Marathon, I would guess that it will need
  to be itself a Mesos framework, and able to publish its resources needs.
  And also, for that, the Job Server has to be aware of the resources
 needed
  by the Spark drivers that it will run, which is not as easy to guess,
  unless it is provided by the job itself?
 
  I didn't checked the Job Server deep enough so it might be already the
 case
  (or I'm expressing something completely dumb ^^).
 
  For sure, we'll try to share it when we'll reach this point to deploy
 using
  marathon (should be planned for April)
 
  greetz and again, Nice Work Evan!
 
  Ndi
 
  On Wed, Mar 19, 2014 at 7:27 AM, Evan Chan e...@ooyala.com wrote:
 
  Andy,
 
  Yeah, we've thought of deploying this on Marathon ourselves, but we're
  not sure how much Mesos we're going to use yet.   (Indeed if you look
  at bin/server_start.sh, I think I set up the PORT environment var
  specifically for Marathon.)This is also why we have deploy scripts
  which package into .tar.gz, again for Mesos deployment.
 
  If you do try this, please let us know.  :)
 
  -Evan
 
 
  On Tue, Mar 18, 2014 at 3:57 PM, andy petrella andy.petre...@gmail.com
 
  wrote:
   tad! That's awesome.
  
   A quick question, does someone has insights regarding having such
   JobServers deployed using Marathon on Mesos?
  
   I'm thinking about an arch where Marathon would deploy and keep the
 Job
   Servers running along with part of the whole set of apps deployed on
 it
   regarding the resources needed (à la Jenkins).
  
   Any idea is welcome.
  
   Back to the news, Evan + Ooyala team: Great Job again.
  
   andy
  
   On Tue, Mar 18, 2014 at 11:39 PM, Henry Saputra 
 henry.sapu...@gmail.com
  wrote:
  
   W00t!
  
   Thanks for releasing this, Evan.
  
   - Henry
  
   On Tue, Mar 18, 2014 at 1:51 PM, Evan Chan e...@ooyala.com wrote:
Dear Spark developers,
   
Ooyala is happy to announce that we have pushed our official, Spark
0.9.0 / Scala 2.10-compatible, job server as a github repo:
   
https://github.com/ooyala/spark-jobserver
   
Complete with unit tests, deploy scripts, and examples.
   
The original PR (#222) on incubator-spark is now closed.
   
Please have a look; pull requests are very welcome.
--
--
Evan Chan
Staff Engineer
e...@ooyala.com  |
  
 
 
 
  --
  --
  Evan Chan
  Staff Engineer
  e...@ooyala.com  |
 



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Re: spark jobserver

2014-03-24 Thread Suhas Satish
Thanks a lot for this update Evan , really appreciate the effort.

On Monday, March 24, 2014, Evan Chan e...@ooyala.com wrote:

 Suhas, here is the update, which I posted to SPARK-818:

 An update: we have put up the final job server here:
 https://github.com/ooyala/spark-jobserver

 The plan is to have a spark-contrib repo/github account and this would
 be one of the first projects.

 See SPARK-1283 for the ticket to track spark-contrib.

 On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish 
 suhas.sat...@gmail.comjavascript:;
 wrote:
  Any plans of integrating SPARK-818 into spark trunk ? The pull request is
  open.
  It offers spark as a service with spark jobserver running as a separate
  process.
 
 
  Thanks,
  Suhas.



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com javascript:;  |



-- 
Cheers,
Suhas.


Re: Spark 0.9.1 release

2014-03-24 Thread Evan Chan
Patrick, yes, that is indeed a risk.

On Mon, Mar 24, 2014 at 12:30 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
 Patrick, that is a good point.


 On Mon, Mar 24, 2014 at 12:14 AM, Patrick Wendell pwend...@gmail.comwrote:

  Spark's dependency graph in a maintenance
 *Modifying* Spark's dependency graph...




-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: spark jobserver

2014-03-24 Thread Evan Chan
Suhas,

You're welcome.  We are planning to speak about the job server at the
Spark Summit by the way.

-Evan


On Mon, Mar 24, 2014 at 9:38 AM, Suhas Satish suhas.sat...@gmail.com wrote:
 Thanks a lot for this update Evan , really appreciate the effort.

 On Monday, March 24, 2014, Evan Chan e...@ooyala.com wrote:

 Suhas, here is the update, which I posted to SPARK-818:

 An update: we have put up the final job server here:
 https://github.com/ooyala/spark-jobserver

 The plan is to have a spark-contrib repo/github account and this would
 be one of the first projects.

 See SPARK-1283 for the ticket to track spark-contrib.

 On Sat, Mar 22, 2014 at 6:15 PM, Suhas Satish 
 suhas.sat...@gmail.comjavascript:;
 wrote:
  Any plans of integrating SPARK-818 into spark trunk ? The pull request is
  open.
  It offers spark as a service with spark jobserver running as a separate
  process.
 
 
  Thanks,
  Suhas.



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com javascript:;  |



 --
 Cheers,
 Suhas.



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |


Re: new Catalyst/SQL component merged into master

2014-03-24 Thread Usman Ghani
How does it compare against Shark, and what is the future of Shark with
this new module in place?


On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote:

 Hi Michael,

 Congrats, this is really neat!

 What thoughts do you have regarding adding indexing support and
 predicate pushdown to this SQL framework?Right now we have custom
 bitmap indexing to speed up queries, so we're really curious as far as
 the architectural direction.

 -Evan


 On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
 mich...@databricks.com wrote:
 
  It will be great if there are any examples or usecases to look at ?
 
  There are examples in the Spark documentation.  Patrick posted and
 updated
  copy here so people can see them before 1.0 is released:
 
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
 
  Does this feature has different usecases than shark or more cleaner as
  hive dependency is gone?
 
  Depending on how you use this, there is still a dependency on Hive (By
  default this is not the case.  See the above documentation for more
  details).  However, the dependency is on a stock version of Hive instead
 of
  one modified by the AMPLab.  Furthermore, Spark SQL has its own
 optimizer,
  instead of relying on the Hive optimizer.  Long term, this is going to
 give
  us a lot more flexibility to optimize queries specifically for the Spark
  execution engine.  We are actively porting over the best parts of shark
  (specifically the in-memory columnar representation).
 
  Shark still has some features that are missing in Spark SQL, including
  SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
  status, it'll likely become the new backend for Shark.



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Re: new Catalyst/SQL component merged into master

2014-03-24 Thread Michael Armbrust
Hi Evan,

Index support is definitely something we would like to add, and it is
possible that adding support for your custom indexing solution would not be
too difficult.

We already push predicates into hive table scan operators when the
predicates are over partition keys.  You can see an example of how we
collect filters and decide which can pushed into the scan using the
HiveTableScan
query planning 
strategyhttps://github.com/marmbrus/spark/blob/0ae86cfcba56b700d8e7bd869379f0c663b21c1e/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L56
.

I'd like to know more about your indexing solution.  Is this something
publicly available?  One concern here is that the query planning code is
not considered a public API and so is likely to change quite a bit as we
improve the optimizer.  Its not currently something that we plan to expose
for external components to modify.

Michael


On Sun, Mar 23, 2014 at 11:49 PM, Evan Chan e...@ooyala.com wrote:

 Hi Michael,

 Congrats, this is really neat!

 What thoughts do you have regarding adding indexing support and
 predicate pushdown to this SQL framework?Right now we have custom
 bitmap indexing to speed up queries, so we're really curious as far as
 the architectural direction.

 -Evan


 On Fri, Mar 21, 2014 at 11:09 AM, Michael Armbrust
 mich...@databricks.com wrote:
 
  It will be great if there are any examples or usecases to look at ?
 
  There are examples in the Spark documentation.  Patrick posted and
 updated
  copy here so people can see them before 1.0 is released:
 
 http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
 
  Does this feature has different usecases than shark or more cleaner as
  hive dependency is gone?
 
  Depending on how you use this, there is still a dependency on Hive (By
  default this is not the case.  See the above documentation for more
  details).  However, the dependency is on a stock version of Hive instead
 of
  one modified by the AMPLab.  Furthermore, Spark SQL has its own
 optimizer,
  instead of relying on the Hive optimizer.  Long term, this is going to
 give
  us a lot more flexibility to optimize queries specifically for the Spark
  execution engine.  We are actively porting over the best parts of shark
  (specifically the in-memory columnar representation).
 
  Shark still has some features that are missing in Spark SQL, including
  SharkServer (and years of testing).  Once SparkSQL graduates from Alpha
  status, it'll likely become the new backend for Shark.



 --
 --
 Evan Chan
 Staff Engineer
 e...@ooyala.com  |



Re: Spark 0.9.1 release

2014-03-24 Thread Kevin Markey

1051 is essential!
I'm not sure about the others, but anything that adds stability to 
Spark/Yarn would  be helpful.

Kevin Markey


On 03/20/2014 01:12 PM, Tom Graves wrote:

I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on 
YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting user 
- JIRA in.  The pyspark one I would consider more of an enhancement so might 
not be appropriate for a point release.

  
  [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YA...

org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 at org.apache.spark.schedule...
View on spark-project.atlassian.net Preview by Yahoo
  
  
  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA

This means that they can't write/read from files that the yarn user doesn't 
have permissions to but the submitting user does.
View on spark-project.atlassian.net Preview by Yahoo
  
  




On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com wrote:
  
It will be great if

SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
Umbrella
for hardening Spark on YARN can get into 0.9.1.

Thanks,
Bhaskar


On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
tathagata.das1...@gmail.comwrote:


   Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordingly
https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)

.

Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD





Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
1051 has been pulled in!

search 1051 in
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=shortlog;h=refs/heads/branch-0.9

TD

On Mon, Mar 24, 2014 at 4:26 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 1051 is essential!
 I'm not sure about the others, but anything that adds stability to
 Spark/Yarn would  be helpful.
 Kevin Markey



 On 03/20/2014 01:12 PM, Tom Graves wrote:

 I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running
 on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting
 user - JIRA in.  The pyspark one I would consider more of an enhancement so
 might not be appropriate for a point release.

 [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on
 YA...
 org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:49)
 at org.apache.spark.schedule...
 View on spark-project.atlassian.net Preview by Yahoo
   [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA
 This means that they can't write/read from files that the yarn user
 doesn't have permissions to but the submitting user does.
 View on spark-project.atlassian.net Preview by Yahoo



 On Thursday, March 20, 2014 1:35 PM, Bhaskar Dutta bhas...@gmail.com
 wrote:
   It will be great if
 SPARK-1101https://spark-project.atlassian.net/browse/SPARK-1101:
 Umbrella
 for hardening Spark on YARN can get into 0.9.1.

 Thanks,
 Bhaskar


 On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das
 tathagata.das1...@gmail.comwrote:

Hello everyone,

 Since the release of Spark 0.9, we have received a number of important
 bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated
 JIRA
 accordingly

 https://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed)

 .

 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD




Re: Spark 0.9.1 release

2014-03-24 Thread Kevin Markey
Is there any way that [SPARK-782] (Shade ASM) can be included?  I see 
that it is not currently backported to 0.9.  But there is no single 
issue that has caused us more grief as we integrate spark-core with 
other project dependencies.  There are way too many libraries out there 
in addition to Spark 0.9 and before that are not well-behaved (ASM FAQ 
recommends shading), including some Hive and Hadoop libraries and a 
number of servlet libraries.  We can't control those, but if Spark were 
well behaved in this regard, it would help.  Even for a maintenance 
release, and even if 1.0 is only 6 weeks away!


(For those not following 782, according to Jira comments, the SBT build 
shades it, but it is the Maven build that ends up in Maven Central.)


Thanks
Kevin Markey



On 03/19/2014 06:07 PM, Tathagata Das wrote:

  Hello everyone,

Since the release of Spark 0.9, we have received a number of important bug
fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
going to cut a release candidate soon and we would love it if people test
it out. We have backported several bug fixes into the 0.9 and updated JIRA
accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).
Please let me know if there are fixes that were not backported but you
would like to see them in 0.9.1.

Thanks!

TD





Re: Spark 0.9.1 release

2014-03-24 Thread Tathagata Das
Hello Kevin,

A fix for SPARK-782 would definitely simplify building against Spark.
However, its possible that a fix for this issue in 0.9.1 will break
the builds (that reference spark) of existing 0.9 users, either due to
a change in the ASM version, or for being incompatible with their
current workarounds for this issue. That is not a good idea for a
maintenance release, especially when 1.0 is not too far away.

Can you (and others) elaborate more on the current workarounds that
you have for this issue? Its best to understand all the implications
of this fix.

Note that in branch 0.9, it is not fixed, neither in SBT nor in Maven.

TD

On Mon, Mar 24, 2014 at 4:38 PM, Kevin Markey kevin.mar...@oracle.com wrote:
 Is there any way that [SPARK-782] (Shade ASM) can be included?  I see that
 it is not currently backported to 0.9.  But there is no single issue that
 has caused us more grief as we integrate spark-core with other project
 dependencies.  There are way too many libraries out there in addition to
 Spark 0.9 and before that are not well-behaved (ASM FAQ recommends shading),
 including some Hive and Hadoop libraries and a number of servlet libraries.
 We can't control those, but if Spark were well behaved in this regard, it
 would help.  Even for a maintenance release, and even if 1.0 is only 6 weeks
 away!

 (For those not following 782, according to Jira comments, the SBT build
 shades it, but it is the Maven build that ends up in Maven Central.)

 Thanks
 Kevin Markey




 On 03/19/2014 06:07 PM, Tathagata Das wrote:

   Hello everyone,

 Since the release of Spark 0.9, we have received a number of important bug
 fixes and we would like to make a bug-fix release of Spark 0.9.1. We are
 going to cut a release candidate soon and we would love it if people test
 it out. We have backported several bug fixes into the 0.9 and updated JIRA

 accordinglyhttps://spark-project.atlassian.net/browse/SPARK-1275?jql=project%20in%20(SPARK%2C%20BLINKDB%2C%20MLI%2C%20MLLIB%2C%20SHARK%2C%20STREAMING%2C%20GRAPH%2C%20TACHYON)%20AND%20fixVersion%20%3D%200.9.1%20AND%20status%20in%20(Resolved%2C%20Closed).

 Please let me know if there are fixes that were not backported but you
 would like to see them in 0.9.1.

 Thanks!

 TD