Distributed Indexing on MapReduce

2012-03-01 Thread Frank Scholten
Hi all,

I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944

What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.

I have been going through the Nutch and contrib/index code and from my
understanding I have to:

* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?

I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS

How do I set this up?

Cheers,

Frank


[Announcement] SearchWorkings.org is live!

2011-09-12 Thread Frank Scholten
Hi all,

This is an announcement of the community site SearchWorkings.org [1]

SearchWorkings.org offers search professionals a point of contact or
comprehensive resource to learn and discuss all the
new developments in the world of open source search and related
subjects like Mahout and Hadoop.

The site is created by a group of search professionals from the
Lucence & Solr community and I am involved in it
to cover topics related to Mahout and Hadoop. The initial focus is on Lucene &
Solr, Mahout and Hadoop but aims to be much broader.

Like any other community website, content will be added on a regular
basis and community members can contribute too.

Right now, you have access to a extensive resource centre offering
online tutorials, downloads, white papers and access to a host of
search specialists in the forum.
In addition you can post blog items and keep up to date with relevant
news.

We look forward to more and more blogs, articles and tutorials, real
case-studies or 3rd party extensions for OSS Search components.

You are more than welcome to contribute and tell your story about
using these technologies.

Have fun,

Frank

[1] http://www.searchworkings.org
[2] Trademark Acknowledgement: Apache Lucene, Apache Solr and Apache
Mahout and respective logos are trademarks of The Apache
Software Foundation. All other marks mentioned may be trademarks or
registered trademarks of their respective owners.