Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can see how this can be very useful! Otherwise... well... It seems like Hadoop gives you a lot of overhead, and it is just not practical to do parsing this way. With a straightforward implementation in Python, on a single Core2

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov wrote: > Hello, > > This is an excellent ne

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Aaron Halfaker
We haven't tried EC2. Since this use-case was really strange for hadoop (if we break the dump up by pages, some are > 160GB!!) we have been rolling our own code and testing on our own machines so we had the flexibility to get things working. Theoretically, one should be able to use this on EC2 (o

Re: [Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Dmitry Chichkov
Hello, This is an excellent news! Have you tried running it on Amazon EC2? It would be really nice to know how well WikiHadoop scale up with the number of nodes. Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on what kind of task (xml parsing, diffs, md5, etc?) was it obtain

[Wiki-research-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

2011-08-17 Thread Diederik van Liere
Hello! Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard on a customized stream-based InputFormatReader that allows parsing of both bz2 compressed and uncompressed files of the full Wikipedia dump