Re: Combining Lucene and database functionality
Marco Schmidt writes: > I'm trying to find out whether Lucene is an option for a project of > mine. I have texts which also have a date and a list of numbers > associated with each of them. These numbers are ID values which connect > the article to certain categories. So a particular article X might > belong to categories 17, 49 and 112. A search for all articles > containing "foo bar" and belonging to categories 100 to 140 should > return X (because it also contains "foo bar"). Is it possible to do this > with Lucene and if it is, how? I've read about the concept of fields in > Lucene, but it seems to me that you can only store text in them, not > integers, let alone list of integers. None of the tutorials I've seen > deals with more complex queries like that. Basically what I want to > accomplish could be done nicely with databases with full text search > capability, if that full text search wasn't so awful. > Where's the problem? 100 is a text as well as an integer (one has to keep in mind, that treating it as text changes sort order, which may require leading 0 to compensate). Lucene does not understand the "words" you index anyway. So if a document has a field `category' with content '017 049 112' and some `text' field with content 'bla fasel foo bar' and you do a range query 100 - 140 on category (search all documents containing any word, that is alphanumerically sorted between 100 and 140) and a apropriate query on text it will find, what you want. There are some caveats like choosing an apropriate analyzer or considering the maximum number of terms the range query covers, but in principle there is no difference between a text field containing words and a category field containing categories. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
displaying 'pages' of search results...
Hi Can u share the searcher.search(query, hitCollector); [light weight paging api ] Code on the form ,may be somebody like me need's it. ; ) Karthik -Original Message- From: Praveen Peddi [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 22, 2004 1:24 AM To: Lucene Users List Subject: Re: displaying 'pages' of search results... The way we do it is: Get all the document ids, cache them and then get the first 50, second 50 documents etc. We wrote a light weight paging api on top of lucene. We call searcher.search(query, hitCollector); Our HitCollectorImpl implements collect method and just collects the document id only. Praveen - Original Message - From: "Chris Fraschetti" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, September 21, 2004 3:33 PM Subject: displaying 'pages' of search results... >I was wondering was the best way was to go about returning say > 1,000,000 results, divided up into say 50 element sections and then > accessing them via the first 50, second 50, etc etc. > > Is there a way to keep the query around so that lucene doesn't need to > search again, or would the search be cached and no delay arise? > > Just looking for some ideas and possibly some implementational issues... > > > > -- > ___ > Chris Fraschetti > e [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Proxy Con. Problem in Weblogic.
This, of course, isn't the right forum for this question... Not to encourage off-topic posts, but I just happened to know at least part of the answer since we just went through the same issue. First thing to do is to make sure you are setting these properties before the first URLStreamHandler for "http" is created. In WebLogic, this means you really must specify them on the command line with -D options, because by the time your code starts to run, WebLogic has already initialized the stream handler. This may or may not help, depending on which JVM is being used. Also, WebLogic supplies its own http stack, they don't use Sun's stack that ships with the JVM. Tomcat does, so the behavior will be different. As long as you are using WebLogic's http stack, be aware that it has different configuration options than the standard Sun's stack. I think the proxy settings use the same -D options, but I may be wrong. You can control which stack is used in various ways. WebLogic recognizes a -D option called: weblogic.net.http.URLStreamHandlerFactory. Google for it and you will find out how it is supposed to work. I know it works at least some of the time, but I wasn't able to get it to work for me. Another way that does work is to set a URLStreamHandler on the URL object when you create it (see java.net.URL constructors). You can create a URLStreamHandler object explicitly and this way force the use of Sun's stack. Sun's handler is called sun.net.www.protocol.http.Handler. Hope this helps. Good luck! Dmitry. Natarajan.T wrote: Hi FYI, I am doing web crawling in my application using proxy setting. like the below code.. Properties systemSettings = System.getProperties(); systemSettings.put("http.proxySet", "true"); systemSettings.put("http.proxyHost", profileBean.getProfileParamBean().getProxyHost().trim()); systemSettings.put("http.proxyPort", profileBean.getProfileParamBean().getProxyPort().trim()); System.setProperties(systemSettings); This code working fine in Tomcat5.0 but I am getting runtime error in Weblogic8. How can I handle this problem??? Advance Thanks. Regards, Natarajan. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Proxy Con. Problem in Weblogic.
Hi FYI, I am doing web crawling in my application using proxy setting. like the below code.. Properties systemSettings = System.getProperties(); systemSettings.put("http.proxySet", "true"); systemSettings.put("http.proxyHost", profileBean.getProfileParamBean().getProxyHost().trim()); systemSettings.put("http.proxyPort", profileBean.getProfileParamBean().getProxyPort().trim()); System.setProperties(systemSettings); This code working fine in Tomcat5.0 but I am getting runtime error in Weblogic8. How can I handle this problem??? Advance Thanks. Regards, Natarajan.
Combining Lucene and database functionality
I'm trying to find out whether Lucene is an option for a project of mine. I have texts which also have a date and a list of numbers associated with each of them. These numbers are ID values which connect the article to certain categories. So a particular article X might belong to categories 17, 49 and 112. A search for all articles containing "foo bar" and belonging to categories 100 to 140 should return X (because it also contains "foo bar"). Is it possible to do this with Lucene and if it is, how? I've read about the concept of fields in Lucene, but it seems to me that you can only store text in them, not integers, let alone list of integers. None of the tutorials I've seen deals with more complex queries like that. Basically what I want to accomplish could be done nicely with databases with full text search capability, if that full text search wasn't so awful. Regards, Marco - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing date ranges
If it is unindexed, then you cannot query on it, so you do not have a choice. The other option is to use a field that is indexed, not tokenized, and not stored (you have to use new Field(...) to accomplish that) if you don't want to store the field data. Erik On Sep 21, 2004, at 5:54 PM, Chris Fraschetti wrote: is it most effecient to index or not index 'numeric' ranges that i will do a range search byepoc_date:[110448 TO 820483200] would be be better to treat it as Field.Keyword or Field.UnIndexed ? -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
indexing date ranges
is it most effecient to index or not index 'numeric' ranges that i will do a range search byepoc_date:[110448 TO 820483200] would be be better to treat it as Field.Keyword or Field.UnIndexed ? -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: displaying 'pages' of search results...
The way we do it is: Get all the document ids, cache them and then get the first 50, second 50 documents etc. We wrote a light weight paging api on top of lucene. We call searcher.search(query, hitCollector); Our HitCollectorImpl implements collect method and just collects the document id only. Praveen - Original Message - From: "Chris Fraschetti" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, September 21, 2004 3:33 PM Subject: displaying 'pages' of search results... I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: displaying 'pages' of search results...
On Tuesday 21 September 2004 21:33, Chris Fraschetti wrote: > I was wondering was the best way was to go about returning say > 1,000,000 results, divided up into say 50 element sections and then > accessing them via the first 50, second 50, etc etc. > > Is there a way to keep the query around so that lucene doesn't need to > search again, or would the search be cached and no delay arise? > > Just looking for some ideas and possibly some implementational issues... Lucene's Hits class is designed for paging through search results. In which order would you need the 1.000.000 results? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: displaying 'pages' of search results...
The best first approach is to simply re-query every time the user goes to a new page, keeping around the query in some for or another (perhaps the expression if you're using QueryParser) and the page number. If that is fast enough, then you're done! :) If it is not, then you could consider caching Hits and your IndexSearcher to re-use when paging. Erik On Sep 21, 2004, at 3:33 PM, Chris Fraschetti wrote: I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
displaying 'pages' of search results...
I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... -- ___ Chris Fraschetti e [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search result grouped in categories?
Hello everyone, I'm trying to use Lucene in a webproject to search for products. The problem is that I have to display the search results grouped by category. There are about 500.000 products and every product belongs to a category. There are 150 categories. Now for every search I would like to display the result like this: Matches where found in category A and B: Category A (500 matches) - < a list of the 10 best matches in this category > Category B (12 matches) - < a list of the 10 best matches in this category > So I'm trying to figure out the best way to accomplish this? I would like to avoid iterating over all matching products to find their categories (There can be about 50.000). Is it reasonable to the search on all categories first and then apply a filter for each category (150 filters)? How fast are filters? or would it be better to have a separate index for each category? or maybe iterate over all the matching product and store them in a hashmap with category as key? Thanks, William
RE: bug in MultiFieldQueryParser.parse
Luceners, I'm using SnowBallAnalyzer for spanish. I'm indexing and searching with these analyzer, but when I open luke and look up the document with standardAnalyzer (I don't know how to use snowBallAnalyzer in spanish in Luke) I see the tokens in case sensitive and I see the word "tarea" instead of "tare". I got a "tare" when searching with snowballSpanish and I got a "tarea" with snowballSpanish when indexing (I'm using luke for looking up in the index). Thanks in advance. -Mensaje original- De: Wermus Fernando [mailto:[EMAIL PROTECTED] Enviado el: Martes, 21 de Septiembre de 2004 10:47 a.m. Para: [EMAIL PROTECTED] Asunto: bug in MultiFieldQueryParser.parse I have this query string queryString = tarea AND (tipo:contact OR tipo:account OR tipo:opportunity OR tipo:event OR tipo:task) and when I parse query=MultiFieldQueryParser.parse(queryString,fields,analyzer); I got one letter less. I have "tarea" and the MultiFieldQueryParser change to "tare". I don't know why. (+mobile:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+fax:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+firstname:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+otherPhone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+phone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+lastname:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+salutation:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+email:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+symbol:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+phone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+shortName:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+location:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+fax:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+number:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+name:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+date:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+subject:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+expiration:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+subject:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problems with Lucene + BDB (Berkeley DB) integration
Try setUseCompoundFile(false) on your IndexWriter as soon as you create it or before you call optimize -Original Message- From: Christian Rodriguez [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 21, 2004 1:10 PM To: Lucene Users List Subject: Re: Problems with Lucene + BDB (Berkeley DB) integration Andy, you are right. I tried with Lucene 1.3 and it worked perfectly. This should be added to a README in the Lucene + BDB sandbox (or somewhere) so people dont spend days struggling with those weird non - deterministic bugs I am getting... Now, I do need to use version 1.4, so Id like to see if its actually the cluster file system causing the problem in the integration with Berkeley DB. Is it possible to turn the cluster file system OFF in Lucene 1.4? Anyone has any idea what other things could be causing BDB + Lucene to work with Lucene 1.3 and not with 1.4? Thanks! Xtian On Mon, 20 Sep 2004 18:13:46 -0700, Andy Goodell <[EMAIL PROTECTED]> wrote: > I used BDB + lucene successfully using the lucene 1.3 distribution, > but it broke in my application with the 1.4 distribution. The 1.4 > dist uses a different file system by default, the "cluster file > system", so maybe that is the source of the issues. > > good luck, > andy g > > > > > On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez > <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > > > I am trying to use the Lucene + BDB integration from the sandbox > > (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ db/). > > I installed C Berkeley DB 4.2.52 and I have the Lucene jar file. > > > > I have an example program that indexes 4 small text files in a > > directory (its very similar to the IndexFiles.java in the Lucene demo, > > except that it uses BDB + Lucene). The problem I have is that > > executing the indexing program generates different results each time I > > run it. For example: If I start with an empty index, run the indexing > > program and then query the index I get the correct results; then I > > delete the index to start from scratch again, and perform the same > > sequence and I get no results. (?) > > > > What puzzles me is the non-deterministic results... the same execution > > sequence generates two different results. I then wrote a program to > > dump the index and I found out that the list of files that end up in > > the index is different every time I index those 4 files. > > > > For example: > > 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm, > > _4.frq, _4.prx, _4.tii, segments, deletable. (9 files) > > 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx, > > _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11 > > files) > > > > Does anyone have any idea why this is happening? > > Has anyone been able to use the BDB + Lucene integration with no problems? > > > > Id appreciate any help or pointers. > > Thanks! > > Xtian > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problems with Lucene + BDB (Berkeley DB) integration
Andy, you are right. I tried with Lucene 1.3 and it worked perfectly. This should be added to a README in the Lucene + BDB sandbox (or somewhere) so people dont spend days struggling with those weird non - deterministic bugs I am getting... Now, I do need to use version 1.4, so Id like to see if its actually the cluster file system causing the problem in the integration with Berkeley DB. Is it possible to turn the cluster file system OFF in Lucene 1.4? Anyone has any idea what other things could be causing BDB + Lucene to work with Lucene 1.3 and not with 1.4? Thanks! Xtian On Mon, 20 Sep 2004 18:13:46 -0700, Andy Goodell <[EMAIL PROTECTED]> wrote: > I used BDB + lucene successfully using the lucene 1.3 distribution, > but it broke in my application with the 1.4 distribution. The 1.4 > dist uses a different file system by default, the "cluster file > system", so maybe that is the source of the issues. > > good luck, > andy g > > > > > On Mon, 20 Sep 2004 19:36:51 -0300, Christian Rodriguez > <[EMAIL PROTECTED]> wrote: > > Hi everyone, > > > > I am trying to use the Lucene + BDB integration from the sandbox > > (http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db/). > > I installed C Berkeley DB 4.2.52 and I have the Lucene jar file. > > > > I have an example program that indexes 4 small text files in a > > directory (its very similar to the IndexFiles.java in the Lucene demo, > > except that it uses BDB + Lucene). The problem I have is that > > executing the indexing program generates different results each time I > > run it. For example: If I start with an empty index, run the indexing > > program and then query the index I get the correct results; then I > > delete the index to start from scratch again, and perform the same > > sequence and I get no results. (?) > > > > What puzzles me is the non-deterministic results... the same execution > > sequence generates two different results. I then wrote a program to > > dump the index and I found out that the list of files that end up in > > the index is different every time I index those 4 files. > > > > For example: > > 1st run: contents of directory: _4.f2, _4.f3, _4.cfs, _4.fdx, _4.fnm, > > _4.frq, _4.prx, _4.tii, segments, deletable. (9 files) > > 2nd run: contents of directory: 0:_4.f1, _4.cfs, _4.fdt, _4.fdx, > > _4.fnm, _4.frq, _4.prx, _4.tii, _4.tis, segments, deletable. (11 > > files) > > > > Does anyone have any idea why this is happening? > > Has anyone been able to use the BDB + Lucene integration with no problems? > > > > Id appreciate any help or pointers. > > Thanks! > > Xtian > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
bug in MultiFieldQueryParser.parse
I have this query string queryString = tarea AND (tipo:contact OR tipo:account OR tipo:opportunity OR tipo:event OR tipo:task) and when I parse query=MultiFieldQueryParser.parse(queryString,fields,analyzer); I got one letter less. I have "tarea" and the MultiFieldQueryParser change to "tare". I don't know why. (+mobile:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+fax:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+firstname:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+otherPhone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+phone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+lastname:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+salutation:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+email:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+symbol:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+phone:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+shortName:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+location:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+fax:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+number:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+name:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+date:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+subject:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+expiration:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task)) (+subject:tare +(tipo:contact tipo:account tipo:opportunity tipo:event tipo:task))