While future Solr-hadoop integration is a definite possibility (and
will enable other cool stuff), it doesn't necessarily seem needed for
the problem you are trying to solve.

> indexing them in parallel is not an option as my target doc size per hr 
> itself can be very huge (3-6M)

I'm not sure I understand... the bigger the indexing job, the more it
makes sense to do in parallel.  If you're not doing any link inversion
for web search, it doesn't seem like hadoop is needed for parallelism.
 If you are doing web crawling, perhaps look to nutch, not hadoop.

-Yonik


On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> I have huge number of documents to index (say per hr) and within a hr I 
> cannot compete it using a single machine. Having them distributed in multiple 
> boxes and indexing them in parallel is not an option as my target doc size 
> per hr itself can be very huge (3-6M). So I am considering using HDFS and 
> MapReduce to do the indexing job within time.
>
> In that regard I have following queries regarding using Solr with Hadoop.
>
> 1. After creating the index using Hadoop whether storing them for query 
> purpose again in HDFS would mean additional performance overhead (compared to 
> storing them in in actual disk in one machine.) ?
>
> 2. What type of change is needed to make Solr wuery read from an index which 
> is stored in HDFS ?
>
> Regards,
> Sourav

Reply via email to