Re: Question concerning speed of Lucene.....
Oliver, On Friday 27 August 2004 22:20, you wrote: > Hi, > > I guess this one of the most often asked question on this mailing list, but > hopefully my question is more specific, so that I can get some input from > you. > > My project is to implement an agency system for newspapers. So I have to > handle about 30 days of text and IPTC data. The later is taken from images > provided by the agencies. I basically get a constant stream of text > messages from the agencies (roughly 2000 per day per agency) and images > (roughly 1000 per day per agency). I have to deal with 4 text and 6 image > agencies. So my daily input is 8000 text messages and 6000 images. The > extracted documents from these text messages and images have a size of > about 1kb. > > The extraction of the data and converting them to Document objects is > already finished and the search using lucence works like a charm. Brilliant > software! > > But now to my questions. In order to understand what I am doing, like to > talk a little about the kind of queries and data I have to deal with. > > * Every message has a priority. An integer value ranging from 1 to 6. > * Every message has a receive date. > * Every message has an agency assigned, basically a unique string > identifier for it. > * Every message has some header data, that is also indexed for refined > searches. > * And of course the actual text included in the text message itself or > the IPTC header of an image. > > Typically I have to kinds of queries. > > * Typical relational queries > > * Show every text messages from a certain agency in the last X days. Probably good for a date filter, see the wiki on RangeQuery, and evt. my previous message on filters (using 2nd index on constraining). Lucene has no facilities for primary keys, so that is up to you. > * Show every image or text message with a higher priority then Y and > from a certain period of time. RangeQuery again for the priority. One can store images in Lucene, but currently only in String format, ie. they'll need some conversion. There was some talk on binary objects (not too) recently, but that is still in development. I'd probably store the images in a file system or in another db for now. OTOH, if you're willing to help storing binary images lucene-dev is nearby. > * Fulltext search Yes :) > * A real fulltext search over all elements using the full power of > lucences query language. Span queries are currently not supported by the query language, you might have a look at the org.apache.lucene.search.spans package. > It is absolutely no question anymore, that the later queries will be done > using Lucene. But can the first type of query is the thing I am thinking > about. Can this be done effeciently with Lucene? So far we use a system Lucene can be as fast as relational databases, provided your lower level java code on IndexReader plays nice with system resources like disk heads and RAM. That means using filters, sorting on index order before using an index and evt. sorting on document number before retrieving stored fields. Lucene's IndexSearcher for searching text queries is quite well behaved in that respect. > that uses a SQL database engine for storing the relevant data and is used > in these queries. But if Lucene is fast enough with these queries too, I am > willing to skip the SQL database at all. But I have to remind, that I will > be indexing about 400.000 messages per month. To easily keep the primary keys in sync between the SQL db and Lucene, I'd start by keeping the images and the full text only in the SQL db. Lucene optimisations (needed after adding/deleting docs) copy all data so it pays to keep the Lucene indexes small. Later you might need multiple indexes, MultiSearcher, and occasionally a merge of the indexes. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question concerning speed of Lucene.....
Hi, I guess this one of the most often asked question on this mailing list, but hopefully my question is more specific, so that I can get some input from you. My project is to implement an agency system for newspapers. So I have to handle about 30 days of text and IPTC data. The later is taken from images provided by the agencies. I basically get a constant stream of text messages from the agencies (roughly 2000 per day per agency) and images (roughly 1000 per day per agency). I have to deal with 4 text and 6 image agencies. So my daily input is 8000 text messages and 6000 images. The extracted documents from these text messages and images have a size of about 1kb. The extraction of the data and converting them to Document objects is already finished and the search using lucence works like a charm. Brilliant software! But now to my questions. In order to understand what I am doing, like to talk a little about the kind of queries and data I have to deal with. * Every message has a priority. An integer value ranging from 1 to 6. * Every message has a receive date. * Every message has an agency assigned, basically a unique string identifier for it. * Every message has some header data, that is also indexed for refined searches. * And of course the actual text included in the text message itself or the IPTC header of an image. Typically I have to kinds of queries. * Typical relational queries * Show every text messages from a certain agency in the last X days. * Show every image or text message with a higher priority then Y and from a certain period of time. * Fulltext search * A real fulltext search over all elements using the full power of lucences query language. It is absolutely no question anymore, that the later queries will be done using Lucene. But can the first type of query is the thing I am thinking about. Can this be done effeciently with Lucene? So far we use a system that uses a SQL database engine for storing the relevant data and is used in these queries. But if Lucene is fast enough with these queries too, I am willing to skip the SQL database at all. But I have to remind, that I will be indexing about 400.000 messages per month. Thanks in advance for every answer. Now I will be going back to having fun with Lucene. Best regards, Oliver Andrich
Re: Using 2nd Index to constraing Search
On Friday 27 August 2004 20:10, Mike Upshon wrote: > Hi > > Just starting to evaluate Lucene and hope somone can answer this > question. > > I am looking at using Lucene to index a very large databse. There is a > documents table and a few other tables that define what users can view > what documents. My question is, is it posible to have an index of the The normal way of doing that is to: - make a list of all doc id's for the user. - from this list construct a Filter for use in the full text index. Sort the doc id's, use an IndexReader on the full text index, construct a Term for each doc id, walk the termDocs() for the Term, and set a bit in the filter to allow the document number for the doc id. - keep this filter to restrict the searches for the user by IndexSearcher.search(Query,Filter) - rebuild the filter when the doc id's for the user change, or when the full text index changes (a document deletion followed by an optimize or an add can change any other document's number). Hmm, this is getting to be a FAQ. > full text contents of the documents and another index that contains the > document id's and the user id's and then use the 2nd index to qualify > the full text search over the document table. The reason I want to do > this is to reduce the numbers of documents that the full text query will > run. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: complex searche (newbie)
This should work: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20int,%20org.apache.lucene.search.Sort) Otis --- Wermus Fernando <[EMAIL PROTECTED]> wrote: > Thanks :) > > It Works :) . One more question. I need to order the hits by a > field in contact called lastname. What do I have to add to the query? > > > > > -Mensaje original- > De: Bernhard Messer [mailto:[EMAIL PROTECTED] > Enviado el: Jueves, 26 de Agosto de 2004 02:17 p.m. > Para: Lucene Users List > Asunto: Re: complex searche (newbie) > > hi, > > in general the query parser doesn't allow queries which start with a > wildcard Those queries could end up with very long response times and > > block your system. This is not what you want. > > I'm not sure if i understand what you want to do. I expect that you > have > > a field within a lucene document with name "type". For this field you > > can have different values like "contact,account" etc. Now you want to > > search all documents where type is "contact". So the query to do this > > would be "type:contact", nothing else is required. > > can you try that and give some feedback ? > > best regards > Bernhard > > > Wermus Fernando wrote: > > >I am using multifieldQueryParse to look up some models. I have > several > >models: account, contacts, tasks, etc. The user chooses models and a > >query string to look up. Besides fields for searching, I add some > >conditions to the query string. > > > >If he puts "john" to look up and chooses contacts, I add to the > query > >string the following > > > >Query string: "john and type:contact" > > > >But, If he wants to look up any contact, multifieldQueryParse throws > an > >exception. In these case, the query string is the following: > > > >Query string: "* and type:contact" > > > >Am I choosing the wrong QueryParser or is there another easy way to > look > >up several fields and the same time any content? > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using 2nd Index to constraing Search
Hi Just starting to evaluate Lucene and hope somone can answer this question. I am looking at using Lucene to index a very large databse. There is a documents table and a few other tables that define what users can view what documents. My question is, is it posible to have an index of the full text contents of the documents and another index that contains the document id's and the user id's and then use the 2nd index to qualify the full text search over the document table. The reason I want to do this is to reduce the numbers of documents that the full text query will run. E.g. if there are 50 million documents in the documents table and 20,000 usesr if the full text search was able to be restricted by the user the full text search over the documents would be run over a substantialy smaller set. Thanks Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: complex searche (newbie)
Thanks :) It Works :) . One more question. I need to order the hits by a field in contact called lastname. What do I have to add to the query? -Mensaje original- De: Bernhard Messer [mailto:[EMAIL PROTECTED] Enviado el: Jueves, 26 de Agosto de 2004 02:17 p.m. Para: Lucene Users List Asunto: Re: complex searche (newbie) hi, in general the query parser doesn't allow queries which start with a wildcard Those queries could end up with very long response times and block your system. This is not what you want. I'm not sure if i understand what you want to do. I expect that you have a field within a lucene document with name "type". For this field you can have different values like "contact,account" etc. Now you want to search all documents where type is "contact". So the query to do this would be "type:contact", nothing else is required. can you try that and give some feedback ? best regards Bernhard Wermus Fernando wrote: >I am using multifieldQueryParse to look up some models. I have several >models: account, contacts, tasks, etc. The user chooses models and a >query string to look up. Besides fields for searching, I add some >conditions to the query string. > >If he puts "john" to look up and chooses contacts, I add to the query >string the following > >Query string: "john and type:contact" > >But, If he wants to look up any contact, multifieldQueryParse throws an >exception. In these case, the query string is the following: > >Query string: "* and type:contact" > >Am I choosing the wrong QueryParser or is there another easy way to look >up several fields and the same time any content? > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Range query problem
A description on how to search numerical fields is available on the wiki: http://wiki.apache.org/jakarta-lucene/SearchNumericalFields sv On Thu, 26 Aug 2004, Alex Kiselevski wrote: > > Thanks, I'll try it > > -Original Message- > From: Daniel Naber [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 26, 2004 12:59 PM > To: Lucene Users List > Subject: Re: Range query problem > > > On Thursday 26 August 2004 11:02, Alex Kiselevski wrote: > > > I have a strange problem with range query "PERIOD:[1 TO 9]" It works > > only if the second parameter is equals or less than 9 If it's greater > > than 9 , it finds no documents > > You have to store your numbers so that they will appear in the right > order > when sorted lexicographically, e.g. save 1 as 01 if you save numbers up > to > 99, or as 0001 if you save numbers up to . You also have to use this > > format for searching I think. > > Regards > Daniel > > -- > http://www.danielnaber.de > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > The information contained in this message is proprietary of Amdocs, > protected from disclosure, and may be privileged. > The information is intended to be conveyed only to the designated recipient(s) > of the message. If the reader of this message is not the intended recipient, > you are hereby notified that any dissemination, use, distribution or copying of > this communication is strictly prohibited and may be unlawful. > If you have received this communication in error, please notify us immediately > by replying to the message and deleting it from your computer. > Thank you. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: speeding up queries (MySQL faster)
FYI, this optimization resulted in a fantastic performance boost! I went from 133 queries/sec to 990 queries per sec! I'm now more limited by socket overhead, as I get 1700 queries/sec when I stick the clients right in the same process as the server. Oddly enough, the performance increased, but the CPU utilization decreased to around 55% (in both configurations above). I'll have to look into that later, but any additional performance at this point is pure gravy. -Yonik --- Yonik Seeley <[EMAIL PROTECTED]> wrote: > Doug wrote: > > For example, Nutch automatically translates such > > clauses into QueryFilters. > > Thanks for the excellent pointer Doug! I'll will > definitely be implementing this optimization. __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Content from multiple folders in single index
You should consider using the Ant task in the Sandbox (contributions/ant directory). You'll need to write a custom document handler implementation to handle PDF's and any other types you like. The built-in handler does text and HTML files, but is pluggable. The task uses Ant's filesets to determine what should be indexed, so you could simply have an excludes="include/" to exclude that directory. Erik On Aug 25, 2004, at 7:00 PM, John Greenhill wrote: Hi, I suspect this is an easy one but I didn't see a reference in the FAQ's so I thought I'd ask. I have a file structure like this: web - pages - downloads (pdf docs) - include I want to index the html in pages and the pdf's in downloads, but not the html in include, so I don't want to start my index at web. I've modified the IndexHTML in demo to do the pdf's. What is the best way to do this? Thanks for your suggestions. John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: weird lock behavior
Yes, creating a unit test that demonstrates things like this is always a problem, but that's probably the only way we can see what's going on. Otis --- [EMAIL PROTECTED] wrote: > Otis, > > no idea how I run into it. That's why to create a unit is a bit > problematic. > > In Lucene code it looks like that reader tries every 1 sec to read > the > index 10 times. After that it says - it just can not be that it's > locked > so long, I'd query anyway. That's why 10sec delay. > > > > > > > >Iouli, > > >This sounds like something that should never happen. It never > happens > >for me at simpy.com - and I use Lucene A LOT there. My guess is > that > >the problem with the commit lock has to do with misuse of > >IndexReader/Searcher. Like with the memory leak issue, please try > >isolating the problem in a unit test that we (Lucene developers) can > >run on our machines. > > >Thanks, > >Otis > > > > > Bernard, > > it's not a problem to unlock the index. The problem is to know that > it's > the time to intercept and do it manually (i.e. it's not a normal lock > came > from reader/writer proceses but just a fantom that got lost). > > > > hi, > > the IndexReader class provides some public static methodes to check > if > an index is locked. If this is the case, there is also a method to > unlock an existing index. You could do something like: > > Directory dir = FSDirectory.getDirectory(indexDir, false); > if (IndexReader.isLocked(dir)) { > IndexReader.unlock(dir); > } > dir.close(); > > You also should catch the possible IOException in case of an error or > > the index can't be unlocked. > > fun with it > Bernhard > > [EMAIL PROTECTED] wrote: > > >Hi, > >I experienced following situation: > > > >Suddenly my query became too slow (c.10sec instead of c.1sec) and > the > >number of returned hits changed from c. 2000 to c.1800. > > > >Tracing the case I've found locking file "abc...-commit.lck". > After > >deletion of this file everything turned back to normal behavior, > i.e. I > >got my 2000 hits in 1sec. > > > >There were no concurent writing or reading processes running > parallely. > > > >Probably the lock file was lost because of abnormal termination ( > during > >development it's ok, but may happen in production as well) > >My question is how to handle such situation, find out and repair in > case > > >it happens (in real life there are many concurensy processes and I > have > no > >idea which lock file to kill). > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: weird lock behavior
Otis, no idea how I run into it. That's why to create a unit is a bit problematic. In Lucene code it looks like that reader tries every 1 sec to read the index 10 times. After that it says - it just can not be that it's locked so long, I'd query anyway. That's why 10sec delay. >Iouli, >This sounds like something that should never happen. It never happens >for me at simpy.com - and I use Lucene A LOT there. My guess is that >the problem with the commit lock has to do with misuse of >IndexReader/Searcher. Like with the memory leak issue, please try >isolating the problem in a unit test that we (Lucene developers) can >run on our machines. >Thanks, >Otis Bernard, it's not a problem to unlock the index. The problem is to know that it's the time to intercept and do it manually (i.e. it's not a normal lock came from reader/writer proceses but just a fantom that got lost). hi, the IndexReader class provides some public static methodes to check if an index is locked. If this is the case, there is also a method to unlock an existing index. You could do something like: Directory dir = FSDirectory.getDirectory(indexDir, false); if (IndexReader.isLocked(dir)) { IndexReader.unlock(dir); } dir.close(); You also should catch the possible IOException in case of an error or the index can't be unlocked. fun with it Bernhard [EMAIL PROTECTED] wrote: >Hi, >I experienced following situation: > >Suddenly my query became too slow (c.10sec instead of c.1sec) and the >number of returned hits changed from c. 2000 to c.1800. > >Tracing the case I've found locking file "abc...-commit.lck". After >deletion of this file everything turned back to normal behavior, i.e. I >got my 2000 hits in 1sec. > >There were no concurent writing or reading processes running parallely. > >Probably the lock file was lost because of abnormal termination ( during >development it's ok, but may happen in production as well) >My question is how to handle such situation, find out and repair in case >it happens (in real life there are many concurensy processes and I have no >idea which lock file to kill). > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]