Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can
see how this can be very useful! Otherwise... well... It seems like Hadoop
gives you a lot of overhead, and it is just not practical to do parsing this
way.
With a straightforward implementation in Python, on a single Core2
So the 14 day task included xml parsing and creating diffs. We might gain
performance improvements by fine-tuning the Hadoop configuration although
that seems to be more of an art than science.
Diederik
On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov wrote:
> Hello,
>
> This is an excellent ne
We haven't tried EC2. Since this use-case was really strange for hadoop (if
we break the dump up by pages, some are > 160GB!!) we have been rolling our
own code and testing on our own machines so we had the flexibility to get
things working. Theoretically, one should be able to use this on EC2 (o
Hello,
This is an excellent news!
Have you tried running it on Amazon EC2? It would be really nice to know how
well WikiHadoop scale up with the number of nodes.
Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what
kind of task (xml parsing, diffs, md5, etc?) was it obtain
Hello!
Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
on a customized stream-based InputFormatReader that allows parsing of both
bz2 compressed and uncompressed files of the full Wikipedia dump