Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Xiangrui Meng
+1 Verified some MLlib bug fixes on OS X. -Xiangrui

On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen so...@cloudera.com wrote:
 Signatures and hashes are good.
 LICENSE, NOTICE still check out.
 Compiles for a Hadoop 2.6 + YARN + Hive profile.

 I still see the UISeleniumSuite test failure observed in 1.3.0, which
 is minor and already fixed. I don't know why I didn't back-port it:
 https://issues.apache.org/jira/browse/SPARK-6205

 If we roll another, let's get this easy fix in, but it is only an
 issue with tests.


 On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
 all look legitimate (e.g. reopened or in progress)


 There is 1 open Blocker for 1.3.1 per Andrew:
 https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
 start even when spark was built in Windows

 I believe this can be resolved quickly but as a matter of hygiene
 should be fixed or demoted before release.


 FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
 examining before release to see how critical they are:

 SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
 application,,Open,4/3/15
 SPARK-6484,Ganglia metrics xml reporter doesn't escape
 correctly,Josh Rosen,Open,3/24/15
 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15
 SPARK-6209,ExecutorClassLoader can leak connections after failing to
 load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
 SPARK-5113,Audit and document use of hostnames and IP addresses in
 Spark,,Open,3/24/15
 SPARK-5098,Number of running tasks become negative after tasks
 lost,,Open,1/14/15
 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
 Wendell,Reopened,3/23/15
 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15
 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
 instances,,Open,1/27/15
 SPARK-4879,Missing output partitions after job completes with
 speculative execution,Josh Rosen,Open,3/5/15
 SPARK-4751,Support dynamic allocation for standalone mode,Andrew
 Or,Open,12/22/14
 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
 SPARK-4452,Shuffle data structures can starve others on the same
 thread for memory,Tianshuo Deng,Open,1/24/15
 SPARK-4352,Incorporate locality preferences in dynamic allocation
 requests,,Open,1/26/15
 SPARK-4227,Document external shuffle service,,Open,3/23/15
 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15

 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Stochastic gradient descent performance

2015-04-05 Thread Shivaram Venkataraman
Yeah, a simple way to estimate the time for an iterative algorithms is
number of iterations required * time per iteration. The time per iteration
will depend on the batch size, computation required and the fixed overheads
I mentioned before. The number of iterations of course depends on the
convergence rate for the problem being solved.

Thanks
Shivaram

On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Hi Shivaram,



 It sounds really interesting! With this time we can estimate if it worth
 considering to run an iterative algorithm on Spark. For example, for SGD on
 Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all
 data by one example not considering the data loading, computation and
 update times. One may need to traverse all data a number of times to
 converge. Let’s say this number is equal to the batch size. So, we remain
 with 62.5 hours overhead. Is it reasonable?



 Best regards, Alexander



 *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu]
 *Sent:* Thursday, April 02, 2015 1:26 PM
 *To:* Joseph Bradley
 *Cc:* Ulanov, Alexander; dev@spark.apache.org
 *Subject:* Re: Stochastic gradient descent performance



 I haven't looked closely at the sampling issues, but regarding the
 aggregation latency, there are fixed overheads (in local and distributed
 mode) with the way aggregation is done in Spark. Launching a stage of
 tasks, fetching outputs from the previous stage etc. all have overhead, so
 I would say its not efficient / recommended to run stages where computation
 is less than 500ms or so. You could increase your batch size based on this
 and hopefully that will help.



 Regarding reducing these overheads by an order of magnitude it is a
 challenging problem given the architecture in Spark -- I have some ideas
 for this, but they are very much at a research stage.



 Thanks
 Shivaram



 On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley jos...@databricks.com
 wrote:

 When you say It seems that instead of sample it is better to shuffle data
 and then access it sequentially by mini-batches, are you sure that holds
 true for a big dataset in a cluster?  As far as implementing it, I haven't
 looked carefully at GapSamplingIterator (in RandomSampler.scala) myself,
 but that looks like it could be modified to be deterministic.

 Hopefully someone else can comment on aggregation in local mode.  I'm not
 sure how much effort has gone into optimizing for local mode.

 Joseph

 On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander 
 alexander.ula...@hp.com
 wrote:

   Hi Joseph,
 
 
 
  Thank you for suggestion!
 
  It seems that instead of sample it is better to shuffle data and then
  access it sequentially by mini-batches. Could you suggest how to
 implement
  it?
 
 
 
  With regards to aggregate (reduce), I am wondering why it works so slow
 in
  local mode? Could you elaborate on this? I do understand that in cluster
  mode the network speed will kick in and then one can blame it.
 
 
 
  Best regards, Alexander
 
 
 
  *From:* Joseph Bradley [mailto:jos...@databricks.com]
  *Sent:* Thursday, April 02, 2015 10:51 AM
  *To:* Ulanov, Alexander
  *Cc:* dev@spark.apache.org
  *Subject:* Re: Stochastic gradient descent performance
 
 
 
  It looks like SPARK-3250 was applied to the sample() which
 GradientDescent
  uses, and that should kick in for your minibatchFraction = 0.4.  Based
 on
  your numbers, aggregation seems like the main issue, though I hesitate to
  optimize aggregation based on local tests for data sizes that small.
 
 
 
  The first thing I'd check for is unnecessary object creation, and to
  profile in a cluster or larger data setting.
 
 
 
  On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander 
  alexander.ula...@hp.com wrote:
 
  Sorry for bothering you again, but I think that it is an important issue
  for applicability of SGD in Spark MLlib. Could Spark developers please
  comment on it.
 
 
  -Original Message-
  From: Ulanov, Alexander
  Sent: Monday, March 30, 2015 5:00 PM
  To: dev@spark.apache.org
  Subject: Stochastic gradient descent performance
 
  Hi,
 
  It seems to me that there is an overhead in runMiniBatchSGD function of
  MLlib's GradientDescent. In particular, sample and treeAggregate
  might take time that is order of magnitude greater than the actual
 gradient
  computation. In particular, for mnist dataset of 60K instances, minibatch
  size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
  aggregate in local mode with 1 data partition on Core i5 processor. The
  actual gradient computation takes 0.002 s. I searched through Spark Jira
  and found that there was recently an update for more efficient sampling
  (SPARK-3250) that is already included in Spark codebase. Is there a way
 to
  reduce the sampling time and local treeRedeuce by order of magnitude?
 
  Best regards, Alexander
 
  -
  To 

Re: Wrong initial bias in GraphX SVDPlusPlus?

2015-04-05 Thread Reynold Xin
Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus
implementaiton.

Jianping,

Can you take a look at this message? Thanks.


On Fri, Apr 3, 2015 at 8:41 AM, Michael Malak 
michaelma...@yahoo.com.invalid wrote:

 I believe that in the initialization portion of GraphX SVDPlusPluS, the
 initialization of biases is incorrect. Specifically, in line

 https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96
 instead of
 (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1))
 it should be
 (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 /
 scala.math.sqrt(msg.get._1))

 That is, the biases bu and bi (both represented as the third component of
 the Tuple4[] above, depending on whether the vertex is a user or an item),
 described in equation (1) of the Koren paper, are supposed to be small
 offsets to the mean (represented by the variable u, signifying the Greek
 letter mu) to account for peculiarities of individual users and items.

 Initializing these biases to wrong values should theoretically not matter
 given enough iterations of the algorithm, but some quick empirical testing
 shows it has trouble converging at all, even after many orders of magnitude
 additional iterations.

 This perhaps could be the source of previously reported trouble with
 SVDPlusPlus.

 http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html

 If after a day, no one tells me I'm crazy here, I'll go ahead and create a
 Jira ticket.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Denny Lee
+1 (non-binding)  Verified various DataFrame functions, Hive integration,
MLlib, etc. on OSX.

On Sun, Apr 5, 2015 at 9:16 PM Xiangrui Meng men...@gmail.com wrote:

 +1 Verified some MLlib bug fixes on OS X. -Xiangrui

 On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen so...@cloudera.com wrote:
  Signatures and hashes are good.
  LICENSE, NOTICE still check out.
  Compiles for a Hadoop 2.6 + YARN + Hive profile.
 
  I still see the UISeleniumSuite test failure observed in 1.3.0, which
  is minor and already fixed. I don't know why I didn't back-port it:
  https://issues.apache.org/jira/browse/SPARK-6205
 
  If we roll another, let's get this easy fix in, but it is only an
  issue with tests.
 
 
  On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
  all look legitimate (e.g. reopened or in progress)
 
 
  There is 1 open Blocker for 1.3.1 per Andrew:
  https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
  start even when spark was built in Windows
 
  I believe this can be resolved quickly but as a matter of hygiene
  should be fixed or demoted before release.
 
 
  FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
  examining before release to see how critical they are:
 
  SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
  application,,Open,4/3/15
  SPARK-6484,Ganglia metrics xml reporter doesn't escape
  correctly,Josh Rosen,Open,3/24/15
  SPARK-6270,Standalone Master hangs when streaming job
 completes,,Open,3/11/15
  SPARK-6209,ExecutorClassLoader can leak connections after failing to
  load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
  SPARK-5113,Audit and document use of hostnames and IP addresses in
  Spark,,Open,3/24/15
  SPARK-5098,Number of running tasks become negative after tasks
  lost,,Open,1/14/15
  SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
  Wendell,Reopened,3/23/15
  SPARK-4922,Support dynamic allocation for coarse-grained
 Mesos,,Open,3/31/15
  SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
  instances,,Open,1/27/15
  SPARK-4879,Missing output partitions after job completes with
  speculative execution,Josh Rosen,Open,3/5/15
  SPARK-4751,Support dynamic allocation for standalone mode,Andrew
  Or,Open,12/22/14
  SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
  SPARK-4452,Shuffle data structures can starve others on the same
  thread for memory,Tianshuo Deng,Open,1/24/15
  SPARK-4352,Incorporate locality preferences in dynamic allocation
  requests,,Open,1/26/15
  SPARK-4227,Document external shuffle service,,Open,3/23/15
  SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15
 
  On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com
 wrote:
  Please vote on releasing the following candidate as Apache Spark
 version 1.3.1!
 
  The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
  https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851
 
  The list of fixes present in this release can be found at:
  http://bit.ly/1C2nVPY
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1080
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.3.1!
 
  The vote is open until Wednesday, April 08, at 01:10 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.3.1
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Sean Owen
Signatures and hashes are good.
LICENSE, NOTICE still check out.
Compiles for a Hadoop 2.6 + YARN + Hive profile.

I still see the UISeleniumSuite test failure observed in 1.3.0, which
is minor and already fixed. I don't know why I didn't back-port it:
https://issues.apache.org/jira/browse/SPARK-6205

If we roll another, let's get this easy fix in, but it is only an
issue with tests.


On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and
all look legitimate (e.g. reopened or in progress)


There is 1 open Blocker for 1.3.1 per Andrew:
https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't
start even when spark was built in Windows

I believe this can be resolved quickly but as a matter of hygiene
should be fixed or demoted before release.


FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth
examining before release to see how critical they are:

SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python
application,,Open,4/3/15
SPARK-6484,Ganglia metrics xml reporter doesn't escape
correctly,Josh Rosen,Open,3/24/15
SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15
SPARK-6209,ExecutorClassLoader can leak connections after failing to
load classes from the REPL class server,Josh Rosen,In Progress,4/2/15
SPARK-5113,Audit and document use of hostnames and IP addresses in
Spark,,Open,3/24/15
SPARK-5098,Number of running tasks become negative after tasks
lost,,Open,1/14/15
SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick
Wendell,Reopened,3/23/15
SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15
SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge
instances,,Open,1/27/15
SPARK-4879,Missing output partitions after job completes with
speculative execution,Josh Rosen,Open,3/5/15
SPARK-4751,Support dynamic allocation for standalone mode,Andrew
Or,Open,12/22/14
SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15
SPARK-4452,Shuffle data structures can starve others on the same
thread for memory,Tianshuo Deng,Open,1/24/15
SPARK-4352,Incorporate locality preferences in dynamic allocation
requests,,Open,1/26/15
SPARK-4227,Document external shuffle service,,Open,3/23/15
SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15

On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote:
 Please vote on releasing the following candidate as Apache Spark version 
 1.3.1!

 The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851

 The list of fixes present in this release can be found at:
 http://bit.ly/1C2nVPY

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1080

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.1!

 The vote is open until Wednesday, April 08, at 01:10 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.1
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Github auth problems = some test results not posting

2015-04-05 Thread Josh Rosen
Thanks for catching this.  It looks like a recent Jenkins job configuration
change inadvertently renamed the GITHUB_OAUTH_KEY environment variable to
something else, causing this to break.  I've rolled back that change, so
hopefully the GitHub posting should start working again.

- Josh

On Sun, Apr 5, 2015 at 6:40 AM, Sean Owen so...@cloudera.com wrote:

 I noticed recent pull request build results weren't posting results of
 MiMa checks, etc.

 I think it's due to Github auth issues:

 Attempting to post to Github...
   http_code: 401.
   api_response: {
   message: Bad credentials,
   documentation_url: https://developer.github.com/v3;
 }

 I've heard another colleague say they're having trouble with
 credentials today. Anyone else?

 I don't know if it's transient or what, but for today, just be aware
 you'll have to look at the end of the Jenkins output to see if these
 other checks passed.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org