Re: Lucene Scalability Question

Ali Salehi Sat, 13 Jan 2007 11:42:46 -0800

Hi all,
 Thanks for your previous mails.

 Let me explain my scenario :
 I have various sources producing sensor data in the format
 I presented before. For each source I have a description part which
 is typically around 5 to 15 lines. I would like to index these data items
and descriptions so that users can perform searches based on keywords in
 description and/or data items to either identify the sources or the data
items.
 For you information the number
 of sources is around 10,000 and the number of data items per
 source is around 5,000,000.


 A query looks like the following :

     geneva center  data:1.123 type:temperatureC

 For doing this i need to rank each data item based on
 it's properties (e.g., precision and freshness,...) in addition
 to the source's properties (e.g., reliability ,user's interest, ...)

The loop I sent in the previous mail was just for
evaluating the performance of a randomly generate query over the
data.

 Regarding the index reader, I'm opening it once :-).
 I appreciate any idea on how to use/adapt Lucene to this scenario.

Best regards,
AliS




>
> : So you mean lucene can't do better than this ?
>
> robert's point is that based on what you've told us, there is no reason to
> think Lucene makes sense for you -- if *all* you are doing is finding
> documents based on numeric rnages, then a relational database is petter
> suited to your task.  if you accutally care about the tetual IR features
> of Lucene, then there are probably ways to make your searches faster, but
> you aren't giving us enough information.
>
> you said the example code you gave was in a loop ... but a loop over what?
> .. what cahnges with each iteration of the loop? ... if there are
> RangeFilter's that ge reused more then once, CachingWrapperFilter can come
> in handy to ensure that work isn't done more often then it needs to me.
>
> it's also not clear wether your query on "type:0" is just a placeholder,
> or indicative of what you acctually want to do in the long run ... if all
> of your queries are this simple, and all you care about is getting a count
> of things that have type:0 and are in your numeric ranges, then don'g use
> the "search" method at all, just put "type:0" in your ChainedFilter and
> call the "bits" method directly.
>
> you also haven't given us any information about wether or not you are
> opening a new IndexSearcher/IndexReader every time you execute a query, or
> resuing the same instance -- reuse makes the perofrance much better
> because it can reuse underlying resources.
>
> In short: if you state some performance numbers from timing some code, and
> want to know how to make that code faster, you have to actualy show people
> *all* of the code for them to be able to help you.
>
>
> : >>  I still have the search problem I had before, now search takes
> around
> : >> 750
> : >> msecs for a small set of documents.
> : >>
> : >>     [java] Total Query Processing time (msec) : 38745
> : >>     [java] Total No. of Documents : 7,500,000
> : >>     [java] Total No. of Executed queries : 50.0
> : >>     [java] Execution time per query : 774.9 msec
> : >>
> : >>  The index is optimized and its size is 830 MB.
> : >>  Each document has the following terms :
> : >>     VSID(integer), data(float), type(short int) , precision (byte).
> : >>   The queries are generate in a loop similar to one below :
> : >> loop ...
> : >>     RangeFilter rq1 = new
> : >>
> RangeFilter("data",â&#65533;+5.43243243440000â&#65533;,â&#65533;+5.43243243449999â&#65533;true,true);
> : >>     RangeFilter rq2 = new RangeFilter
> : >>
("precision",â&#65533;+0001â&#65533;,â&#65533;+0002â&#65533;,true,true);
> : >>     ChainedFilter cf = new ChainedFilter(new
> : >> Filter[]{rq2,rq1},ChainedFilter.AND);
> : >>     Query query = qp.parse("type:0");
> : >>     Hits hits = searcher.search(query,cf);
> : >> end loop
> : >>
> : >>  I would like to know if there exist any solution to improve the
> search
> : >> time ?  (I need to insert more than 500 million of these data pages
> into
> : >> lucene)
>
>
>
>
> -Hoss
>


**************************************************************
Ali Salehi, LSIR - Distributed Information Systems Laboratory
EPFL-IC-IIF-LSIR, Bâtiment BC, Station 14, CH-1015 Lausanne, Switzerland.
http://lsirwww.epfl.ch/
email: [EMAIL PROTECTED]
Tel: +41-21-6936656 Fax: +41-21-6938115




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene Scalability Question

Reply via email to