Re: [VOTE] Release Apache Spark 1.3.1
+1 Verified some MLlib bug fixes on OS X. -Xiangrui On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen so...@cloudera.com wrote: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in 1.3.0, which is minor and already fixed. I don't know why I didn't back-port it: https://issues.apache.org/jira/browse/SPARK-6205 If we roll another, let's get this easy fix in, but it is only an issue with tests. On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and all look legitimate (e.g. reopened or in progress) There is 1 open Blocker for 1.3.1 per Andrew: https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't start even when spark was built in Windows I believe this can be resolved quickly but as a matter of hygiene should be fixed or demoted before release. FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth examining before release to see how critical they are: SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application,,Open,4/3/15 SPARK-6484,Ganglia metrics xml reporter doesn't escape correctly,Josh Rosen,Open,3/24/15 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15 SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/2/15 SPARK-5113,Audit and document use of hostnames and IP addresses in Spark,,Open,3/24/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick Wendell,Reopened,3/23/15 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4751,Support dynamic allocation for standalone mode,Andrew Or,Open,12/22/14 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4352,Incorporate locality preferences in dynamic allocation requests,,Open,1/26/15 SPARK-4227,Document external shuffle service,,Open,3/23/15 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1080 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Wednesday, April 08, at 01:10 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Stochastic gradient descent performance
Yeah, a simple way to estimate the time for an iterative algorithms is number of iterations required * time per iteration. The time per iteration will depend on the batch size, computation required and the fixed overheads I mentioned before. The number of iterations of course depends on the convergence rate for the problem being solved. Thanks Shivaram On Thu, Apr 2, 2015 at 2:19 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Shivaram, It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading, computation and update times. One may need to traverse all data a number of times to converge. Let’s say this number is equal to the batch size. So, we remain with 62.5 hours overhead. Is it reasonable? Best regards, Alexander *From:* Shivaram Venkataraman [mailto:shiva...@eecs.berkeley.edu] *Sent:* Thursday, April 02, 2015 1:26 PM *To:* Joseph Bradley *Cc:* Ulanov, Alexander; dev@spark.apache.org *Subject:* Re: Stochastic gradient descent performance I haven't looked closely at the sampling issues, but regarding the aggregation latency, there are fixed overheads (in local and distributed mode) with the way aggregation is done in Spark. Launching a stage of tasks, fetching outputs from the previous stage etc. all have overhead, so I would say its not efficient / recommended to run stages where computation is less than 500ms or so. You could increase your batch size based on this and hopefully that will help. Regarding reducing these overheads by an order of magnitude it is a challenging problem given the architecture in Spark -- I have some ideas for this, but they are very much at a research stage. Thanks Shivaram On Thu, Apr 2, 2015 at 12:00 PM, Joseph Bradley jos...@databricks.com wrote: When you say It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches, are you sure that holds true for a big dataset in a cluster? As far as implementing it, I haven't looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, but that looks like it could be modified to be deterministic. Hopefully someone else can comment on aggregation in local mode. I'm not sure how much effort has gone into optimizing for local mode. Joseph On Thu, Apr 2, 2015 at 11:33 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi Joseph, Thank you for suggestion! It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches. Could you suggest how to implement it? With regards to aggregate (reduce), I am wondering why it works so slow in local mode? Could you elaborate on this? I do understand that in cluster mode the network speed will kick in and then one can blame it. Best regards, Alexander *From:* Joseph Bradley [mailto:jos...@databricks.com] *Sent:* Thursday, April 02, 2015 10:51 AM *To:* Ulanov, Alexander *Cc:* dev@spark.apache.org *Subject:* Re: Stochastic gradient descent performance It looks like SPARK-3250 was applied to the sample() which GradientDescent uses, and that should kick in for your minibatchFraction = 0.4. Based on your numbers, aggregation seems like the main issue, though I hesitate to optimize aggregation based on local tests for data sizes that small. The first thing I'd check for is unnecessary object creation, and to profile in a cluster or larger data setting. On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander alexander.ula...@hp.com wrote: Sorry for bothering you again, but I think that it is an important issue for applicability of SGD in Spark MLlib. Could Spark developers please comment on it. -Original Message- From: Ulanov, Alexander Sent: Monday, March 30, 2015 5:00 PM To: dev@spark.apache.org Subject: Stochastic gradient descent performance Hi, It seems to me that there is an overhead in runMiniBatchSGD function of MLlib's GradientDescent. In particular, sample and treeAggregate might take time that is order of magnitude greater than the actual gradient computation. In particular, for mnist dataset of 60K instances, minibatch size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to aggregate in local mode with 1 data partition on Core i5 processor. The actual gradient computation takes 0.002 s. I searched through Spark Jira and found that there was recently an update for more efficient sampling (SPARK-3250) that is already included in Spark codebase. Is there a way to reduce the sampling time and local treeRedeuce by order of magnitude? Best regards, Alexander - To
Re: Wrong initial bias in GraphX SVDPlusPlus?
Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus implementaiton. Jianping, Can you take a look at this message? Thanks. On Fri, Apr 3, 2015 at 8:41 AM, Michael Malak michaelma...@yahoo.com.invalid wrote: I believe that in the initialization portion of GraphX SVDPlusPluS, the initialization of biases is incorrect. Specifically, in line https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/SVDPlusPlus.scala#L96 instead of (vd._1, vd._2, msg.get._2 / msg.get._1, 1.0 / scala.math.sqrt(msg.get._1)) it should be (vd._1, vd._2, msg.get._2 / msg.get._1 - u, 1.0 / scala.math.sqrt(msg.get._1)) That is, the biases bu and bi (both represented as the third component of the Tuple4[] above, depending on whether the vertex is a user or an item), described in equation (1) of the Koren paper, are supposed to be small offsets to the mean (represented by the variable u, signifying the Greek letter mu) to account for peculiarities of individual users and items. Initializing these biases to wrong values should theoretically not matter given enough iterations of the algorithm, but some quick empirical testing shows it has trouble converging at all, even after many orders of magnitude additional iterations. This perhaps could be the source of previously reported trouble with SVDPlusPlus. http://apache-spark-user-list.1001560.n3.nabble.com/GraphX-SVDPlusPlus-problem-td12885.html If after a day, no one tells me I'm crazy here, I'll go ahead and create a Jira ticket. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1
+1 (non-binding) Verified various DataFrame functions, Hive integration, MLlib, etc. on OSX. On Sun, Apr 5, 2015 at 9:16 PM Xiangrui Meng men...@gmail.com wrote: +1 Verified some MLlib bug fixes on OS X. -Xiangrui On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen so...@cloudera.com wrote: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in 1.3.0, which is minor and already fixed. I don't know why I didn't back-port it: https://issues.apache.org/jira/browse/SPARK-6205 If we roll another, let's get this easy fix in, but it is only an issue with tests. On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and all look legitimate (e.g. reopened or in progress) There is 1 open Blocker for 1.3.1 per Andrew: https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't start even when spark was built in Windows I believe this can be resolved quickly but as a matter of hygiene should be fixed or demoted before release. FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth examining before release to see how critical they are: SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application,,Open,4/3/15 SPARK-6484,Ganglia metrics xml reporter doesn't escape correctly,Josh Rosen,Open,3/24/15 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15 SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/2/15 SPARK-5113,Audit and document use of hostnames and IP addresses in Spark,,Open,3/24/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick Wendell,Reopened,3/23/15 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4751,Support dynamic allocation for standalone mode,Andrew Or,Open,12/22/14 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4352,Incorporate locality preferences in dynamic allocation requests,,Open,1/26/15 SPARK-4227,Document external shuffle service,,Open,3/23/15 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1080 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Wednesday, April 08, at 01:10 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.1
Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in 1.3.0, which is minor and already fixed. I don't know why I didn't back-port it: https://issues.apache.org/jira/browse/SPARK-6205 If we roll another, let's get this easy fix in, but it is only an issue with tests. On JIRA, I checked open issues with Fix Version = 1.3.0 or 1.3.1 and all look legitimate (e.g. reopened or in progress) There is 1 open Blocker for 1.3.1 per Andrew: https://issues.apache.org/jira/browse/SPARK-6673 spark-shell.cmd can't start even when spark was built in Windows I believe this can be resolved quickly but as a matter of hygiene should be fixed or demoted before release. FYI there are 16 Critical issues marked for 1.3.0 / 1.3.1; worth examining before release to see how critical they are: SPARK-6701,Flaky test: o.a.s.deploy.yarn.YarnClusterSuite Python application,,Open,4/3/15 SPARK-6484,Ganglia metrics xml reporter doesn't escape correctly,Josh Rosen,Open,3/24/15 SPARK-6270,Standalone Master hangs when streaming job completes,,Open,3/11/15 SPARK-6209,ExecutorClassLoader can leak connections after failing to load classes from the REPL class server,Josh Rosen,In Progress,4/2/15 SPARK-5113,Audit and document use of hostnames and IP addresses in Spark,,Open,3/24/15 SPARK-5098,Number of running tasks become negative after tasks lost,,Open,1/14/15 SPARK-4925,Publish Spark SQL hive-thriftserver maven artifact,Patrick Wendell,Reopened,3/23/15 SPARK-4922,Support dynamic allocation for coarse-grained Mesos,,Open,3/31/15 SPARK-4888,Spark EC2 doesn't mount local disks for i2.8xlarge instances,,Open,1/27/15 SPARK-4879,Missing output partitions after job completes with speculative execution,Josh Rosen,Open,3/5/15 SPARK-4751,Support dynamic allocation for standalone mode,Andrew Or,Open,12/22/14 SPARK-4454,Race condition in DAGScheduler,Josh Rosen,Reopened,2/18/15 SPARK-4452,Shuffle data structures can starve others on the same thread for memory,Tianshuo Deng,Open,1/24/15 SPARK-4352,Incorporate locality preferences in dynamic allocation requests,,Open,1/26/15 SPARK-4227,Document external shuffle service,,Open,3/23/15 SPARK-3650,Triangle Count handles reverse edges incorrectly,,Open,2/23/15 On Sun, Apr 5, 2015 at 1:09 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=0dcb5d9f31b713ed90bcec63ebc4e530cbb69851 The list of fixes present in this release can be found at: http://bit.ly/1C2nVPY The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1080 The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.1! The vote is open until Wednesday, April 08, at 01:10 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.1 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Github auth problems = some test results not posting
Thanks for catching this. It looks like a recent Jenkins job configuration change inadvertently renamed the GITHUB_OAUTH_KEY environment variable to something else, causing this to break. I've rolled back that change, so hopefully the GitHub posting should start working again. - Josh On Sun, Apr 5, 2015 at 6:40 AM, Sean Owen so...@cloudera.com wrote: I noticed recent pull request build results weren't posting results of MiMa checks, etc. I think it's due to Github auth issues: Attempting to post to Github... http_code: 401. api_response: { message: Bad credentials, documentation_url: https://developer.github.com/v3; } I've heard another colleague say they're having trouble with credentials today. Anyone else? I don't know if it's transient or what, but for today, just be aware you'll have to look at the end of the Jenkins output to see if these other checks passed. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org