While future Solr-hadoop integration is a definite possibility (and will enable other cool stuff), it doesn't necessarily seem needed for the problem you are trying to solve.
> indexing them in parallel is not an option as my target doc size per hr > itself can be very huge (3-6M) I'm not sure I understand... the bigger the indexing job, the more it makes sense to do in parallel. If you're not doing any link inversion for web search, it doesn't seem like hadoop is needed for parallelism. If you are doing web crawling, perhaps look to nutch, not hadoop. -Yonik On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi All, > > I have huge number of documents to index (say per hr) and within a hr I > cannot compete it using a single machine. Having them distributed in multiple > boxes and indexing them in parallel is not an option as my target doc size > per hr itself can be very huge (3-6M). So I am considering using HDFS and > MapReduce to do the indexing job within time. > > In that regard I have following queries regarding using Solr with Hadoop. > > 1. After creating the index using Hadoop whether storing them for query > purpose again in HDFS would mean additional performance overhead (compared to > storing them in in actual disk in one machine.) ? > > 2. What type of change is needed to make Solr wuery read from an index which > is stored in HDFS ? > > Regards, > Sourav