Hi Yonik,

There is a case where I'm expecting at peak season around 36M doc per day, at 
hourly level peaking to 2-3M per hr. Now I need to do some processing of those 
docs before I index them. Then based on the performance figure of indexing I 
saw in http://wiki.apache.org/solr/SolrPerformanceFactors (the embedded vs http 
post section) - it looks like it would take more than 2 hr index a 3M records 
using 4 machine. So I thought it would be difficult to achieve my goal only 
through Solr I need something else to further increasing the parallel 
processing.

All together the doc size targeted would be around average 3B (the size would 
be around 300 Gb). The docs would get constantly added and deleted every day 
basis at an average rate of 8M per day peak being 36M. Now considering around 
10 boxes, every box need to store around 250M docs.


Regards,
Sourav



-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Friday, November 28, 2008 5:38 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solr with Hadoop ....

The indexing rate you need to achieve should be equal to the rate that
new documents are produced.  It shouldn't matter much how long it
takes to index 3-6M documents the first time (within reason), given
that you only need to do it once/occasionally.  What is that rate
(i.e. why do you think you can't do it on a single box)?

For the scale of documents you are talking about, hadoop sounds like
it would complicate things more than simplify them.

There is a pending Solr patch for using custom IndexReader factories
that could easily open multiple indexes to search across (no optimize
needed).  Or, it would be relatively trivial to write a Lucene program
to merge the indexes.  You could also leave the indexes on multiple
boxes and use Solr's distributed search to search across them
(assuming you really didn't really need everything on a single box).

-Yonik

On Fri, Nov 28, 2008 at 7:01 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi Yonik,
>
> Let me explain why I thought using hadoop will help in achieving the parallel 
> indexing better.
>
> Here are the set of requirements and constraints -
>
> 1. The 3-6M documents (around 300 to 600 MB size) would belong to the same 
> schema
> 2. The resulting index of those 3-6M documents has to reside in a single box 
> (the target box).
> 3. I have to use desktop grade servers with limited RAM (say maximum 2 GB) 
> and single CPU but large enough disk space above 100 GB.
>
> Now if I try to achieve indexing for 3-6M records by running single thread in 
> each of those servers then the steps are -
>
> 1. Create index in all N boxes
> 2. Merge those indexes in the target box
> 3. Optimize the resulting index in the target box.
>
> In Hadoop way what I need to do -
>
> 1. Use those 'N' servers to create the HDFS.
> 2. Copy the raw data (3-6M records) to the HDFS.
> 3. Then use Map/Reduce for indexing those documents and optimize.
>
> I this in this way the index merging and optimization time would be less as 
> those would not be limited by my single server's CPU and memory instead 
> through Map/Reduce the same would be happening in multiple boxes utilizing 
> their CPUs and memory6 in parallel. As I know this way Rackspace implemented 
> Solr's integration with Hadoop and got benefitted. But I realize that this 
> integration is not available open source way
>
> Also please let me know if there is other option to reduce indexing time in 
> my case within Solr given the limited capabilities of the servers.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
> Sent: Friday, November 28, 2008 1:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solr with Hadoop ....
>
> While future Solr-hadoop integration is a definite possibility (and
> will enable other cool stuff), it doesn't necessarily seem needed for
> the problem you are trying to solve.
>
>> indexing them in parallel is not an option as my target doc size per hr 
>> itself can be very huge (3-6M)
>
> I'm not sure I understand... the bigger the indexing job, the more it
> makes sense to do in parallel.  If you're not doing any link inversion
> for web search, it doesn't seem like hadoop is needed for parallelism.
>  If you are doing web crawling, perhaps look to nutch, not hadoop.
>
> -Yonik
>
>
> On Fri, Nov 28, 2008 at 1:31 PM, souravm <[EMAIL PROTECTED]> wrote:
>> Hi All,
>>
>> I have huge number of documents to index (say per hr) and within a hr I 
>> cannot compete it using a single machine. Having them distributed in 
>> multiple boxes and indexing them in parallel is not an option as my target 
>> doc size per hr itself can be very huge (3-6M). So I am considering using 
>> HDFS and MapReduce to do the indexing job within time.
>>
>> In that regard I have following queries regarding using Solr with Hadoop.
>>
>> 1. After creating the index using Hadoop whether storing them for query 
>> purpose again in HDFS would mean additional performance overhead (compared 
>> to storing them in in actual disk in one machine.) ?
>>
>> 2. What type of change is needed to make Solr wuery read from an index which 
>> is stored in HDFS ?
>>
>> Regards,
>> Sourav

**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Reply via email to