Ed Summers, 22/06/2011 12:14:
On Wed, Jun 22, 2011 at 2:25 AM, Dmitry Chichkov wrote:
You may want to take a look at the wpcvn.com - it also displays realtime
stream (filtered)...
Oh wow, maybe I can shut mine off now :-)
Looks like the opposite happened.
Nemo
Just verified, it is back up. And actual changes are also coming through
[filtered by negative user ratings (calculated using some pretty old
wikipedia dump)].
-- Best, Dmitry
On Wed, Aug 17, 2011 at 2:33 AM, Dmitry Chichkov dchich...@gmail.comwrote:
Hmm... Somebody actually visited the site.
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
on a customized stream-based InputFormatReader that allows parsing of both
bz2 compressed and uncompressed files of the full Wikipedia dump
Hello,
This is an excellent news!
Have you tried running it on Amazon EC2? It would be really nice to know how
well WikiHadoop scale up with the number of nodes.
Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump, on what
kind of task (xml parsing, diffs, md5, etc?) was it
We haven't tried EC2. Since this use-case was really strange for hadoop (if
we break the dump up by pages, some are 160GB!!) we have been rolling our
own code and testing on our own machines so we had the flexibility to get
things working. Theoretically, one should be able to use this on EC2
So the 14 day task included xml parsing and creating diffs. We might gain
performance improvements by fine-tuning the Hadoop configuration although
that seems to be more of an art than science.
Diederik
On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov dchich...@gmail.comwrote:
Hello,
This
Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can
see how this can be very useful! Otherwise... well... It seems like Hadoop
gives you a lot of overhead, and it is just not practical to do parsing this
way.
With a straightforward implementation in Python, on a single