Hi Yonik,

Let me explain why I thought using hadoop will help in achieving the parallel 
indexing better.

Here are the set of requirements and constraints -

1. The 3-6M documents (around 300 to 600 MB size) would belong to the same 
schema
2. The resulting index of those 3-6M documents has to reside in a single box 
(the target box).
3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) and 
single CPU but large enough disk space above 100 GB.

Now if I try to achieve indexing for 3-6M records by running single thread in 
each of those servers then the steps are -

1. Create index in all N boxes
2. Merge those indexes in the target box
3. Optimize the resulting index in the target box.

In Hadoop way what I need to do -

1. Use those 'N' servers to create the HDFS. 
2. Copy the raw data (3-6M records) to the HDFS.
3. Then use Map/Reduce for indexing those documents and optimize. 

I this in this way the index merging and optimization time would be less as 
those would not be limited by my single server's CPU and memory instead through 
Map/Reduce the same would be happening in multiple boxes utilizing their CPUs 
and memory6 in parallel. As I know this way Rackspace implemented Solr's 
integration with Hadoop and got benefitted. But I realize that this integration 
is not available open source way

Also please let me know if there is other option to reduce indexing time in my 
case within Solr given the limited capabilities of the servers.

Regards,
Sourav



-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Friday, November 28, 2008 1:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

While future Solr-hadoop integration is a definite possibility (and
will enable other cool stuff), it doesn't necessarily seem needed for
the problem you are trying to solve.

> indexing them in parallel is not an option as my target doc size per hr 
> itself can be very huge (3-6M)

I'm not sure I understand... the bigger the indexing job, the more it
makes sense to do in parallel.  If you're not doing any link inversion
for web search, it doesn't seem like hadoop is needed for parallelism.
 If you are doing web crawling, perhaps look to nutch, not hadoop.

-Yonik


On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi All,
>
> I have huge number of documents to index (say per hr) and within a hr I 
> cannot compete it using a single machine. Having them distributed in multiple 
> boxes and indexing them in parallel is not an option as my target doc size 
> per hr itself can be very huge (3-6M). So I am considering using HDFS and 
> MapReduce to do the indexing job within time.
>
> In that regard I have following queries regarding using Solr with Hadoop.
>
> 1. After creating the index using Hadoop whether storing them for query 
> purpose again in HDFS would mean additional performance overhead (compared to 
> storing them in in actual disk in one machine.) ?
>
> 2. What type of change is needed to make Solr wuery read from an index which 
> is stored in HDFS ?
>
> Regards,
> Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Reply via email to