Re: Newbie questions
Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Thanks in advance, PJ --- Paul Jans [EMAIL PROTECTED] wrote: I've already ordered Lucene in Action :) There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ I will keep an eye on that for sure. You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) We're already using Oracle, so would it be possible to store the index there, thus giving each cluster node easy access to it. I read about SqlDirectory in the archives but it looks like it didn't make it to the API and I don't see it on the contrib page. I'm more concerned about making the index accessible rather than transactional consistency, so NFS may be another option like you mention. I'm curious to hear about other systems which are clustered and how others are doing this; lessons learnt and best practices etc. Thanks again for the help. Lucene looks like a first class tool. PJ --- Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 14, 2005, at 2:40 PM, Paul Jans wrote: Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Neither. As far as I know, Berkeley DB is the only viable DB implementation currently. NFS has notoriously had issues with Lucene and file locking. Search the archives for more details on this. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
On Feb 11, 2005, at 1:36 PM, Erik Hatcher wrote: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] oops, to be completely technically correct, use curly brackets to get rather than = degree:cs AND gpa:{3.0 TO 9.9} (I'll assume GPA's only go to 4.0 or 5.0 :) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
I've already ordered Lucene in Action :) There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ I will keep an eye on that for sure. You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) We're already using Oracle, so would it be possible to store the index there, thus giving each cluster node easy access to it. I read about SqlDirectory in the archives but it looks like it didn't make it to the API and I don't see it on the contrib page. I'm more concerned about making the index accessible rather than transactional consistency, so NFS may be another option like you mention. I'm curious to hear about other systems which are clustered and how others are doing this; lessons learnt and best practices etc. Thanks again for the help. Lucene looks like a first class tool. PJ --- Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie questions
Hi, A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? Thank you in advance, PJ __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering
On May 30, 2004, at 10:34 PM, Sasha Haghani wrote: I am a newbie to Lucene and I'm considering using it in an upcoming project. I've read through the documentation but I still have a number of questions: I'll do my best with some pointers below... 1. SEGMENTING AN INDEX QUERIES BY SITE SCOPE In my use case, I have a number of logical websites backed by the same underlying content store. A Document may be ultimately end up belonging to one or more logical sites, but at a distinct URL for each. The simplistic solution is to maintain indices for each logical site, but this will result in some unwanted duplication and the need to update multiple indices on shared content changes. Other than that, can anyone suggest approaches for how to segment a single index to accomodate multiple logical sites and allow queries within a particlar site's scope? Are fields the solution? How should the distinct per-site URLs be managed? I don't think there is a definitive best way to do this. Per-site indexes is one option. Using a site field is another. Queries for a particular site could be done either by using QueryFilter or by wrapping all queries in a BooleanQuery with a required TermQuery for the site. Sites could share documents by simply adding multiple Field.Keyword(site, site) to the documents. 2. LOCALIZED CONTENT I understand that at its core, Lucene can support content from any locale and character set supported by Java. What is the best way of implementing Lucene to handle a content base which includes numerous locales. One index per locale or should all Documents be placed in a single index and tagged with a locale field? Or is there another approach altogether? Again, there isn't really a best way, I don't think. How does the locale situation relate to the previously mentioned site separation? A locale field is a perfectly reasonable way to go also. I don't know of any other approach. 3. DOCUMENT URLS Is the URL at which the original document can be retrieved generally (i.e., for linking search results to the original doc) stored as a non-index, non-tokenized, stored Field in the Document? It depends on whether you want to query for it or not. Field.Keyword if you want to be able to query for it. Field.UnIndexed if you want it with the attributes you specified. 4. QUERY FILTERING SORTING BY FIELD VALUE In my application I have a pretty typical need to distinguish between different document types (e.g., FAQs, Articles, Reviews, etc.) in order to allow the user to restrict their results to particular types of documents or to sort results by type. Are fields again the solution for this? Can Queries filter or sort results/hits on exact field values (i.e., non-tokenized field values). Fields are generally the solution :) What else is there? Documents have Fields. Fields are where you put metadata about documents. A document type makes perfect sense to put in a field. QueryFilter or the BooleanQuery AND trick mentioned above would allow you to narrow results down to a particular set of types. Sorting works on exact values, yes, and you can write your own sorting implementation if lexicographic or numeric sorting are not sufficient which could key off external information if needed. To sort on a field, it needs to be indexed and non-tokenized (stored is irrelevant). There must be only a single term for that field in a document. Check the Javadocs for the Sort class for more details on the sorting requirements. 5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT How is Lucene to be deployed in a clustered web-app environment? Do all cluster nodes require access to a networked filesystem containing the index files or is there another solution? How is concurrency managed when the index is being incrementally updated? This is entirely up to you to manage. I'm sure developers building solutions with Lucene have employed all sorts of various architectures. Concurrency is managed via lock files that need to be shared among apps interacting with the index. The short answer is only a single process (but multiple threads sharing an IndexWriter) can index at a time. You would probably want to build some sort of queuing infrastructure and have a single indexer, or index into separate indexes and merge them. Any answers and suggestions are much appreciated. Thanks. I hope this helps some. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering
Hi there, I am a newbie to Lucene and I'm considering using it in an upcoming project. I've read through the documentation but I still have a number of questions: 1. SEGMENTING AN INDEX QUERIES BY SITE SCOPE In my use case, I have a number of logical websites backed by the same underlying content store. A Document may be ultimately end up belonging to one or more logical sites, but at a distinct URL for each. The simplistic solution is to maintain indices for each logical site, but this will result in some unwanted duplication and the need to update multiple indices on shared content changes. Other than that, can anyone suggest approaches for how to segment a single index to accomodate multiple logical sites and allow queries within a particlar site's scope? Are fields the solution? How should the distinct per-site URLs be managed? 2. LOCALIZED CONTENT I understand that at its core, Lucene can support content from any locale and character set supported by Java. What is the best way of implementing Lucene to handle a content base which includes numerous locales. One index per locale or should all Documents be placed in a single index and tagged with a locale field? Or is there another approach altogether? 3. DOCUMENT URLS Is the URL at which the original document can be retrieved generally (i.e., for linking search results to the original doc) stored as a non-index, non-tokenized, stored Field in the Document? 4. QUERY FILTERING SORTING BY FIELD VALUE In my application I have a pretty typical need to distinguish between different document types (e.g., FAQs, Articles, Reviews, etc.) in order to allow the user to restrict their results to particular types of documents or to sort results by type. Are fields again the solution for this? Can Queries filter or sort results/hits on exact field values (i.e., non-tokenized field values). 5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT How is Lucene to be deployed in a clustered web-app environment? Do all cluster nodes require access to a networked filesystem containing the index files or is there another solution? How is concurrency managed when the index is being incrementally updated? Any answers and suggestions are much appreciated. Thanks. --Daniel
Newbie Questions
Hi all... I've been playing with Lucene for a couple days now and I have a couple questions I'm hoping some one can help me with. I've created a Lucene index with data from a database that's in several different fields, and I want to set up a web page where users can search the index. Ideally, all searches should be as google-like as possible. In Lucene terms, I guess this means the query should be fuzzy. For example, if someone searches for cancer then I'd like to get back all resuls with any form of the word cancer in the term (cancerous, breast cancer, etc.). So far, I seem to be having two problems: 1) How can I search all fields at the same time? The QueryParser seems to only search one specific field. 2) How can I automatically default all searches into fuzzy mode? I don't want my users to have to know that they must add a ~ at the end of all their terms. Thanks, -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie Questions
Hi Mark, short answers to your questions: ad 1: MultiFieldQueryParser is what you might want: you can specify the fields to run the query on. Alternatively, the practice of duplicating the contents of all separate fields in question into one additional merged field has been suggested, which enables you to use QueryParser itself. ad 2: Depending on the Analyzer you use, the query is normalised, i.e., stemmed (remove suffices from words) and stopword-filtered (remove highly frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how the different filters work. In the analysis package the 1.3rc2 Lucene distribution has a Porter stemming algorithm: PorterStemmer. Have fun, Gregor -Original Message- From: Mark Woon [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 26, 2003 6:54 AM To: [EMAIL PROTECTED] Subject: Newbie Questions Hi all... I've been playing with Lucene for a couple days now and I have a couple questions I'm hoping some one can help me with. I've created a Lucene index with data from a database that's in several different fields, and I want to set up a web page where users can search the index. Ideally, all searches should be as google-like as possible. In Lucene terms, I guess this means the query should be fuzzy. For example, if someone searches for cancer then I'd like to get back all resuls with any form of the word cancer in the term (cancerous, breast cancer, etc.). So far, I seem to be having two problems: 1) How can I search all fields at the same time? The QueryParser seems to only search one specific field. 2) How can I automatically default all searches into fuzzy mode? I don't want my users to have to know that they must add a ~ at the end of all their terms. Thanks, -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie Questions
1. You need to use MultiFieldQueryParser 2. I think you should use PorterStemFilter instead of fuzzy query http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/Por terStemFilter.html -Original Message- From: Mark Woon [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 26, 2003 12:54 AM To: [EMAIL PROTECTED] Subject: Newbie Questions Hi all... I've been playing with Lucene for a couple days now and I have a couple questions I'm hoping some one can help me with. I've created a Lucene index with data from a database that's in several different fields, and I want to set up a web page where users can search the index. Ideally, all searches should be as google-like as possible. In Lucene terms, I guess this means the query should be fuzzy. For example, if someone searches for cancer then I'd like to get back all resuls with any form of the word cancer in the term (cancerous, breast cancer, etc.). So far, I seem to be having two problems: 1) How can I search all fields at the same time? The QueryParser seems to only search one specific field. 2) How can I automatically default all searches into fuzzy mode? I don't want my users to have to know that they must add a ~ at the end of all their terms. Thanks, -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie Questions
On Tuesday, August 26, 2003, at 12:53 AM, Mark Woon wrote: 1) How can I search all fields at the same time? The QueryParser seems to only search one specific field. The common thing I've done and seen others do is glue all the fields together into a master searchable field named something like contents or keywords (be sure to put a space in between text so it can be tokenized properly). 2) How can I automatically default all searches into fuzzy mode? I don't want my users to have to know that they must add a ~ at the end of all their terms. Your description of searches for cancer finding cancerous isn't really what the fuzzy query is about. What you're after, I think, is more the stemming algorithms used during the analysis phase. Have a look at the SnowballAnalyzer in the Lucene sandbox. There is a little bit about it in the article I wrote for java.net: http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html - it definitely sounds like more work in the analysis phase is what you're after. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Newbie Questions
Hi Mark. Sorry, it's rc1 really which is out. But if you go to the cvs server, then you'll find the rc2-dev version. Multiple calls to Document.add with the same field results in that their text is treated as though appended for the purposes of search. (API doc). Can you try out if there's a differece between the cases you mention? I don' t know but I'd be interested as well;-). Gregor -Original Message- From: Mark Woon [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 26, 2003 8:52 PM To: Lucene Users List Subject: Re: Newbie Questions Gregor Heinrich wrote: ad 1: MultiFieldQueryParser is what you might want: you can specify the fields to run the query on. Alternatively, the practice of duplicating the contents of all separate fields in question into one additional merged field has been suggested, which enables you to use QueryParser itself. Ah, I've been testing out something similar to the latter. I've been adding multiple values on the same key. Won't this have the same effect? I've been assuming that if I do doc.add(Field.Keyword(content, value1); doc.add(Field.Keyword(content, value2); And did a search on the content field for either value, I'd get a hit, and it seems to work. This way, I figure I'd be able to differentiate between values that I want tokenized and values that I don't. Is there a difference between this and building a StringBuffer containing all the values and storing that as a single field-value? ad 2: Depending on the Analyzer you use, the query is normalised, i.e., stemmed (remove suffices from words) and stopword-filtered (remove highly frequent words). Have a look at StandardAnalyzer.tokenStream(...) to see how the different filters work. In the analysis package the 1.3rc2 Lucene distribution has a Porter stemming algorithm: PorterStemmer. There's an rc2 out? Where?? I just checked the Lucene website and only see rc1. Thanks everyone for all the quick responses! -Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie Questions
Hi there, I'm new to Lucene and have what will hopefully be a couple of simple questions. 1. Can I index numbers with Lucene? If so, ints or floats or ? 2. Can I index dates with Lucene? In either case, is there any way I can sort the results returned by a search on these fields? Also, can I search for only documents which have been indexed with a range in one of these fields? For example: I only want documents where the 'cost' field is between 1000 and 2000 and where the date of manufacture was prior to 13th June 1978. cheers, Chris -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
newbie questions
Title: En blanco I´m trying to implement Lucene in my application but I´m really a newbie. 1) If I want to create a Index in the directory e:\Lucene, must I just do writer = new IndexWriter("E:/Lucene", null, true); ? 2) How exactly can I create a Index in a database ? Can anybody send a sample ? 3) Talking about the boolean third parameter in IndexWriter if I write writer = new IndexWriter("E:/Lucene", null, false); and the index dont exist..is the Index created anyway ? (I must use it to control if the index is already writed or not) Thanks a lot !!! __David Bonilla FuertesTHE BIT BANG NETWORKhttp://www.bit-bang.comProfesor Waksman, 8, 6º B28036 MadridSPAINTel.: (+34) 914 577 747Móvil: 656 62 83 92Fax: (+34) 914 586 176__