Hi Yonik, Let me explain why I thought using hadoop will help in achieving the parallel indexing better.
Here are the set of requirements and constraints - 1. The 3-6M documents (around 300 to 600 MB size) would belong to the same schema 2. The resulting index of those 3-6M documents has to reside in a single box (the target box). 3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) and single CPU but large enough disk space above 100 GB. Now if I try to achieve indexing for 3-6M records by running single thread in each of those servers then the steps are - 1. Create index in all N boxes 2. Merge those indexes in the target box 3. Optimize the resulting index in the target box. In Hadoop way what I need to do - 1. Use those 'N' servers to create the HDFS. 2. Copy the raw data (3-6M records) to the HDFS. 3. Then use Map/Reduce for indexing those documents and optimize. I this in this way the index merging and optimization time would be less as those would not be limited by my single server's CPU and memory instead through Map/Reduce the same would be happening in multiple boxes utilizing their CPUs and memory6 in parallel. As I know this way Rackspace implemented Solr's integration with Hadoop and got benefitted. But I realize that this integration is not available open source way Also please let me know if there is other option to reduce indexing time in my case within Solr given the limited capabilities of the servers. Regards, Sourav -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Friday, November 28, 2008 1:58 PM To: solr-user@lucene.apache.org Subject: Re: Using Solr with Hadoop .... While future Solr-hadoop integration is a definite possibility (and will enable other cool stuff), it doesn't necessarily seem needed for the problem you are trying to solve. > indexing them in parallel is not an option as my target doc size per hr > itself can be very huge (3-6M) I'm not sure I understand... the bigger the indexing job, the more it makes sense to do in parallel. If you're not doing any link inversion for web search, it doesn't seem like hadoop is needed for parallelism. If you are doing web crawling, perhaps look to nutch, not hadoop. -Yonik On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi All, > > I have huge number of documents to index (say per hr) and within a hr I > cannot compete it using a single machine. Having them distributed in multiple > boxes and indexing them in parallel is not an option as my target doc size > per hr itself can be very huge (3-6M). So I am considering using HDFS and > MapReduce to do the indexing job within time. > > In that regard I have following queries regarding using Solr with Hadoop. > > 1. After creating the index using Hadoop whether storing them for query > purpose again in HDFS would mean additional performance overhead (compared to > storing them in in actual disk in one machine.) ? > > 2. What type of change is needed to make Solr wuery read from an index which > is stored in HDFS ? > > Regards, > Sourav **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***