Re: [Wiki-research-l] wikistream: displays wikipedia updates in realtime

2011-08-17 Thread Federico Leva (Nemo)
Ed Summers, 22/06/2011 12:14: On Wed, Jun 22, 2011 at 2:25 AM, Dmitry Chichkov wrote: You may want to take a look at the wpcvn.com - it also displays realtime stream (filtered)... Oh wow, maybe I can shut mine off now :-) Looks like the opposite happened. Nemo

Re: [Wiki-research-l] wikistream: displays wikipedia updates in realtime

2011-08-17 Thread Dmitry Chichkov
Just verified, it is back up. And actual changes are also coming through [filtered by negative user ratings (calculated using some pretty old wikipedia dump)]. -- Best, Dmitry On Wed, Aug 17, 2011 at 2:33 AM, Dmitry Chichkov dchich...@gmail.comwrote: Hmm... Somebody actually visited the site.

[Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
Hello! Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Hello, This is an excellent news! Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on what kind of task (xml parsing, diffs, md5, etc?) was it

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Aaron Halfaker
We haven't tried EC2. Since this use-case was really strange for hadoop (if we break the dump up by pages, some are 160GB!!) we have been rolling our own code and testing on our own machines so we had the flexibility to get things working. Theoretically, one should be able to use this on EC2

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchich...@gmail.comwrote: Hello, This

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can see how this can be very useful! Otherwise... well... It seems like Hadoop gives you a lot of overhead, and it is just not practical to do parsing this way. With a straightforward implementation in Python, on a single