I don't understand Gianmarco's argument. Do you claim that people use Giraph only with more vertices than Integer.MAX_VALUE?
On Mon, Apr 15, 2013 at 12:28 AM, Avery Ching <[email protected]> wrote: > I generally agree and can understand that is mostly typically true, but > many other benchmarks are doing this to show off performance. Also, if you > have the FB graph of a billion users, it could theoretically fit into an > 32-bit integer. > > Avery > > > On 4/14/13 2:41 PM, Gianmarco De Francisci Morales wrote: > >> Hi, >> >> only one quick comment on optimizations and using ints as ids. >> In my opinion, if you can use an int as an id for your dataset, probably >> you don't need Giraph for your problem. >> Just my 2c >> >> Cheers, >> >> -- >> Gianmarco >> >> >> On Sun, Apr 14, 2013 at 11:26 PM, Sebastian Schelter <[email protected]> >> wrote: >> >> Thank you, Avery, wish I had found the bug earlier. >>> Am 14.04.2013 23:25 schrieb "Avery Ching" <[email protected]>: >>> >>> Thanks for your input Sebastian. Given the choice to removing >>>> PageRankVertex or adding the fix, I've added your fix and will cut RC2 a >>>> bit later today. I really hope this is the last RC. >>>> >>>> Avery >>>> >>>> On 4/14/13 9:34 AM, Sebastian Schelter wrote: >>>> >>>> Hi Avery, >>>>> >>>>> I see your concerns. The benchmarking question is difficult, we had >>>>> very >>>>> bad experiences with Mahout in that regards. E.g., we once had a >>>>> M/R-based PageRank implementation in Mahout that uses our integer-based >>>>> vectors and removed it as we got public complaints that you can't fit >>>>> the whole web into the range of an integer. Personally, I'd also >>>>> refrain >>>>> from using floats instead of doubles for benchmarks, as this simply >>>>> means you give up on accuracy. >>>>> >>>>> Regarding benchmarks, I guess the best thing we could do is publish our >>>>> own numbers. The current runtimes I've seen are already very good, >>>>> Giraph beat a very optimized Stratosphere implementation that we did >>>>> for >>>>> a recent paper by approx. 25%. >>>>> >>>>> To conclude, I do in no way want to hold up the current release. I'm >>>>> perfectly fine with not including the patch and optimizing the >>>>> implementation for a 1.0.1 release, but then we should remove the >>>>> current examples.PageRankVertex from the 1.0 release, as the >>>>> convergence >>>>> detection is broken and we should not knowingly ship bugged code. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> On 14.04.2013 18:18, Avery Ching wrote: >>>>> >>>>> Hi Sebastian, >>>>>> >>>>>> Thanks for the patch. I'll try to take a look at it. >>>>>> >>>>>> The only reason I bring the optimizations up is that a lot of folks >>>>>> >>>>> tend >>> >>>> to compare PageRank performance. The optimizations I'm referring to >>>>>> >>>>> are >>> >>>> Giraph ones, not algorithmic ones. We use ints, floats for ids, >>>>>> messages, respectively instead longs, doubles (1/2 network traffic) >>>>>> and >>>>>> IntNullArrayEdges vertex edges (efficient array backed edges) instead >>>>>> >>>>> of >>> >>>> ByteArrayEdges. You can see >>>>>> https://issues.apache.org/****jira/browse/giraph-543<https://issues.apache.org/**jira/browse/giraph-543> >>>>>> < >>>>>> >>>>> https://issues.apache.org/**jira/browse/giraph-543<https://issues.apache.org/jira/browse/giraph-543>>for >>> more details. >>> >>>> Anyway, given that we are going to ship a 1.0.1 release in a few weeks >>>>>> for a variety of reasons, should this really hold up the current >>>>>> release? I would prefer to not cut anymore RCs unless things are >>>>>> totally broken (i.e. profiles not compiling, major Giraph bugs, etc.). >>>>>> There are still a lot of outstanding issues in JIRA, we can't fix them >>>>>> all for the 1.0 release. >>>>>> >>>>>> Let me know what you think. >>>>>> >>>>>> Avery >>>>>> >>>>>> On 4/13/13 10:46 AM, Sebastian Schelter wrote: >>>>>> >>>>>> Hi Avery, >>>>>>> >>>>>>> I found the bug and can I provide a patch today or tomorrow, so >>>>>>> hopefully we can include that in the release (to not knowingly ship >>>>>>> bugged code). Furthermore I improved the code to protect against >>>>>>> rounding errors. >>>>>>> >>>>>>> I don't really get what you mean with the missing optimization in >>>>>>> comparison to the benchmark PageRank implementation. >>>>>>> >>>>>>> The implementation in o.a.g.examples.PageRankVertex aims to be a >>>>>>> >>>>>> robust >>> >>>> real-world implementation. As optimization, it dismisses edge weights >>>>>>> and reuses objects where possible. Furthermore it is able to handle >>>>>>> dangling vertices that are present in almost every real-world network >>>>>>> and it automatically detects the number of supersteps to run. With >>>>>>> the >>>>>>> patch, it should also provide improved numerical stability. >>>>>>> >>>>>>> If the runtimes doesn't look good enough when compared to the >>>>>>> >>>>>> benchmark >>> >>>> implementation, this might also be caused by the dataset which has a >>>>>>> skewed degree distribution (like most real-world networks). The >>>>>>> benchmark uses a uniform degree distribution AFAIK. >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> On 13.04.2013 15:46, Avery Ching wrote: >>>>>>> >>>>>>> That's great Sebastian. I would also recommend taking a look at the >>>>>>>> PageRankBenchmark for a performance comparison. It has been a lot >>>>>>>> of >>>>>>>> speed improvements that should be a bunch faster than >>>>>>>> PageRankVertex. >>>>>>>> Even that though, is not totally optimized. Hopefully we'll be >>>>>>>> >>>>>>> adding >>> >>>> a >>>>>>>> "how to optimize performance" guide in the near future. Should we >>>>>>>> delay >>>>>>>> the release or simply just ship a 1.1, say in the next month with >>>>>>>> >>>>>>> this >>> >>>> fix and supporting YARN's 2.0.4? I'd like to get on a more normal >>>>>>>> release cycle rather than once a year =). >>>>>>>> >>>>>>>> Avery >>>>>>>> >>>>>>>> On 4/13/13 3:02 AM, Sebastian Schelter wrote: >>>>>>>> >>>>>>>> Hi there, >>>>>>>>> >>>>>>>>> I got some good and bad news, I tested PageRankVertex (not the >>>>>>>>> Benchmark >>>>>>>>> but the example implementation o.a.g.examples.PageRankVertex) from >>>>>>>>> trunk >>>>>>>>> compiled for Hadoop 1.0 on a cluster of 26 machines with 208 cores. >>>>>>>>> >>>>>>>>> I used the Webbase2001 dataset [1] which has 115M vertices and more >>>>>>>>> than >>>>>>>>> 1B edges and got some awesome running times, average superstep >>>>>>>>> takes >>>>>>>>> 15 >>>>>>>>> seconds (!!!). Awesome work, I have to say! >>>>>>>>> >>>>>>>>> Unfortunately, there seems to be an issue with the convergence >>>>>>>>> detection, as it didn't get the correct convergence behavior. I'd >>>>>>>>> >>>>>>>> like >>> >>>> to have a look into that this week, so we can ship a performant >>>>>>>>> PageRank >>>>>>>>> implementation which automatically runs an appropriate number of >>>>>>>>> supersteps. Hope this doesn't delay the release too much. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> http://law.di.unimi.it/****webdata/webbase-2001/<http://law.di.unimi.it/**webdata/webbase-2001/> >>>>>>>>> < >>>>>>>>> >>>>>>>> http://law.di.unimi.it/**webdata/webbase-2001/<http://law.di.unimi.it/webdata/webbase-2001/> >>> > >>> >>>> >>>>>>>>> On 13.04.2013 07:39, Avery Ching wrote: >>>>>>>>> >>>>>>>>> Thanks to the quick feedback from Roman and Lewis, we have cut a >>>>>>>>>> new RC1 >>>>>>>>>> that addresses the following issues. >>>>>>>>>> >>>>>>>>>> * Got rid of .git repo in tarball >>>>>>>>>> * Fixed issue with not compiling without git repo (GIRAPH-628) >>>>>>>>>> * Used gnutar in OSX rather than tar to generate the tarball and >>>>>>>>>> get rid >>>>>>>>>> of warnings >>>>>>>>>> * Pushed GIRAPH-627 to support the yarn profile better >>>>>>>>>> * Tarball name changed to the final artifact name >>>>>>>>>> >>>>>>>>> (giraph-1.0.tar.gz) >>> >>>> Release notes: >>>>>>>>>> http://people.apache.org/~****aching/giraph-1.0-RC1/RELEASE_****<http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_**> >>>>>>>>>> NOTES.html< >>>>>>>>>> >>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/RELEASE_** >>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC1/RELEASE_NOTES.html> >>> > >>> >>>> Release artifacts: >>>>>>>>>> http://people.apache.org/~****aching/giraph-1.0-RC1/<http://people.apache.org/~**aching/giraph-1.0-RC1/> >>>>>>>>>> < >>>>>>>>>> >>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC1/<http://people.apache.org/~aching/giraph-1.0-RC1/> >>> > >>> >>>> Corresponding git tag: >>>>>>>>>> https://git-wip-us.apache.org/****repos/asf?p=giraph.git;a=**<https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**> >>>>>>>>>> shortlog;h=refs/tags/release-****1.0-RC1< >>>>>>>>>> >>>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=** >>> shortlog;h=refs/tags/release-**1.0-RC1<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC1> >>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Signing keys: >>>>>>>>>> http://people.apache.org/keys/****group/giraph.asc<http://people.apache.org/keys/**group/giraph.asc> >>>>>>>>>> < >>>>>>>>>> >>>>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc> >>> > >>> >>>> The vote runs for 72 hours, until Monday 11pm PST. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Avery >>>>>>>>>> >>>>>>>>>> Original message below regarding rc0: >>>>>>>>>> >>>>>>>>>> ------------------------------****- >>>>>>>>>> >>>>>>>>>> Fellow Giraphers, >>>>>>>>>> >>>>>>>>>> We have a our first release candidate since graduating from >>>>>>>>>> incubation. >>>>>>>>>> This is a source release, primarily due to the different >>>>>>>>>> versions of >>>>>>>>>> Hadoop we support with munge (similar to the 0.1 release). Since >>>>>>>>>> 0.1, >>>>>>>>>> we've made A TON of progress on overall performance, optimizing >>>>>>>>>> memory >>>>>>>>>> use, split vertex/edge inputs, easy interoperability with Apache >>>>>>>>>> Hive, >>>>>>>>>> and a bunch of other areas. In many ways, this is an almost >>>>>>>>>> >>>>>>>>> totally >>> >>>> different codebase. Thanks everyone for your hard work! >>>>>>>>>> >>>>>>>>>> Apache Giraph has been running in production at Facebook (against >>>>>>>>>> Facebook's Corona implementation of Hadoop - >>>>>>>>>> https://github.com/facebook/****hadoop-20/tree/master/src/**<https://github.com/facebook/**hadoop-20/tree/master/src/**> >>>>>>>>>> contrib/corona< >>>>>>>>>> >>>>>>>>> https://github.com/facebook/**hadoop-20/tree/master/src/** >>> contrib/corona<https://github.com/facebook/hadoop-20/tree/master/src/contrib/corona> >>> > >>> >>>> ) >>>>>>>>>> since around last December. It has proven to be very scalable, >>>>>>>>>> performant, and enables a bunch of new applications. Based on the >>>>>>>>>> drastic improvements and the use of Giraph in production, it seems >>>>>>>>>> appropriate to bump up our version to 1.0. >>>>>>>>>> >>>>>>>>>> While anyone can vote, the ASF requires majority approval from the >>>>>>>>>> PMC >>>>>>>>>> -- i.e., at least three PMC members must vote affirmatively for >>>>>>>>>> release, >>>>>>>>>> and there must be more positive than negative votes. Releases may >>>>>>>>>> not be >>>>>>>>>> vetoed. Before voting +1 PMC members are required to download the >>>>>>>>>> signed >>>>>>>>>> source code package, compile it as provided, and test the >>>>>>>>>> resulting >>>>>>>>>> executable on their own platform, along with also verifying that >>>>>>>>>> >>>>>>>>> the >>> >>>> package meets the requirements of the ASF policy on releases. >>>>>>>>>> >>>>>>>>>> Please test this against many other Hadoop versions and let us >>>>>>>>>> know >>>>>>>>>> how >>>>>>>>>> this goes! >>>>>>>>>> >>>>>>>>>> Release notes: >>>>>>>>>> http://people.apache.org/~****aching/giraph-1.0-RC0/RELEASE_****<http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_**> >>>>>>>>>> NOTES.html< >>>>>>>>>> >>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/RELEASE_** >>> NOTES.html<http://people.apache.org/~aching/giraph-1.0-RC0/RELEASE_NOTES.html> >>> > >>> >>>> Release artifacts: >>>>>>>>>> http://people.apache.org/~****aching/giraph-1.0-RC0/<http://people.apache.org/~**aching/giraph-1.0-RC0/> >>>>>>>>>> < >>>>>>>>>> >>>>>>>>> http://people.apache.org/~**aching/giraph-1.0-RC0/<http://people.apache.org/~aching/giraph-1.0-RC0/> >>> > >>> >>>> Corresponding git tag: >>>>>>>>>> https://git-wip-us.apache.org/****repos/asf?p=giraph.git;a=**<https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=**> >>>>>>>>>> shortlog;h=refs/tags/release-****1.0-RC0< >>>>>>>>>> >>>>>>>>> https://git-wip-us.apache.org/**repos/asf?p=giraph.git;a=** >>> shortlog;h=refs/tags/release-**1.0-RC0<https://git-wip-us.apache.org/repos/asf?p=giraph.git;a=shortlog;h=refs/tags/release-1.0-RC0> >>> >>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Signing keys: >>>>>>>>>> http://people.apache.org/keys/****group/giraph.asc<http://people.apache.org/keys/**group/giraph.asc> >>>>>>>>>> < >>>>>>>>>> >>>>>>>>> http://people.apache.org/keys/**group/giraph.asc<http://people.apache.org/keys/group/giraph.asc> >>> > >>> >>>> The vote runs for 72 hours, until Monday 4pm PST. >>>>>>>>>> >>>>>>>>>> Thanks everyone for your patience with this release! >>>>>>>>>> >>>>>>>>>> Avery >>>>>>>>>> >>>>>>>>>> > -- Claudio Martella [email protected]
