Re: [Lucene2.0]How to not highlight keywords in some fields?
Pass a field name to the QueryScorer constructor. See testFieldSpecificHighlighting method in the Junit test for the highlighter for an example. Cheers Mark zhu jiang wrote: Hi all, For example, if I have a document with two fields text and num like this: text:foo bar 1 num:1 When users query foo, I changed the query to text:foo AND num:1, both foo and 1 in the text field will be highlighted. I don't wanna the word 1 in text field to be highlighted. What should I do? Pls help me ___ The all-new Yahoo! Mail goes wherever you go - free your email address from your Internet provider. http://uk.docs.yahoo.com/nowyoucan.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[3]: how to enhance speed of sorted search
Hello, Chris. CH 3) most likely, if you are seeing slow performance from sorted searches, CH the time spent scoring the results isn't the biggest contributor to how CH long thesearch takes -- it tends to be negligable for most queries. A CH better question is: are you reusing the exact same IndexReader / CH IndexSearcher instance for every querey? ... if not, that right there is CH going to be your biggest problem, because it will prevent you from being CH able to reuse teh FieldCache needed when sorting results. Sure I do reuse IndexSearcher :) and second query is always faster than the first one... I am thinking should be this faster query = QueryParser(text, StandardAnalyzer()).parse(good boy) IndexSearcher.search( new ConstantScoreQuery(new QueryFilter(query)), sortByIntField) than usual search IndexSearcher.search( query, sortByIntField) Is there anyway I could use filter to get needed results from the query? -- Yura Smolsky, http://altervisionmedia.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to tell if IndexSearcher/IndexReader was closed?
Hi all, after I delete some entries from the index, I close the IndexSearcher to ensure that the changes are done. But after this I couldn't figure out a way to tell if the searcher is closed or not. Any ideas? Regards Frank - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to tell if IndexSearcher/IndexReader was closed?
I guess there are many possibilities to implement some control structure to track the references to your searcher / reader. As it is best practice to have one single searcher open you can track the reference to the searcher while one reference is hold by the class you request your searcher from. If you close your searcher you decrement the reference which is hold by the, I will call it controller, class and create a new searcher. If no reference to the searcher is remaining you close the searcher and it will get garbage collected. If you use this kind of pattern you might have more than one searcher open for a short time. but if the last search client has decremented the reference the searcher will be closed. You don't have to care whether the searcher is closed or not, you won't get a reference to a closed searcher instance. Solr and GData Server uses this kind of reference tracking for this purpose. have a look at http://svn.apache.org/viewvc/lucene/java/trunk/contrib/gdata-server/src/java/org/apache/lucene/gdata/utils/ReferenceCounter.java?view=markup best regards Simon On 9/26/06, Frank Kunemann [EMAIL PROTECTED] wrote: Hi all, after I delete some entries from the index, I close the IndexSearcher to ensure that the changes are done. But after this I couldn't figure out a way to tell if the searcher is closed or not. Any ideas? Regards Frank - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to remove duplicate records from result
Hi, I searched the index and i found say 1000 records but out of that 1000 records i want to filter duplicate records based on value of one field. is there any way except looping through whole Hit object ? Because it wont work when number of hit is too large... Thanks. Bhavin pandya
spell checker with lucene
Hi, Do anybody have idea for spell checker in java. I want to use with lucene...but which must work well for phrases also... -Bhavin pandya
Re: searching for the part of a term.
Hi, While i was searching forum for my problem for searching a substring, i got few very good links. http://www.gossamer-threads.com/lists/lucene/java-user/39753?search_string=Bitset%20filter;#39753 http://www.gossamer-threads.com/lists/lucene/java-user/7813?search_string=substring;#7813 http://www.gossamer-threads.com/lists/lucene/java-user/5931?search_string=substring;#5931 In first, WildcardTermEnum is used. I tried with this but this takes a lot of time in searching. The other solution i found was to create a tokenstream which splits a token into multiple tokens, and then index those token. like : google into google, oogle, ogle And then while searching make a prefix query , then search. But here it seems to create a lot of tokens from one token resulting index size multiple times bigger then if we index a single token. Since the overhead in first is the speed of the system, i think adopting second method will be better. Is there any other solution for this problem?? Am i going in right direction?? It'll be great to see your response... Regards, On 9/23/06, heritrix. lucene [EMAIL PROTECTED] wrote: Hi All, How can i make my search so that if i am looking for the term counting the documents containing accounting must also come up. Similarly if i am looking for term workload, document s containing work also come up as a search result. Wildcard query seems to work in the first case, but if the index size is very big, it throws TooManyClausesException. Is there a way to resolve this issue, apart from indexing n-grams of each term. Regards,
Re: does anyone know of a 'smart' categorizing text pattern finder?
Look at LingPipe from Alias-i.com. Look at Named Entity extraction and its classifiers. Otis - Original Message From: Vladimir Olenin [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Monday, September 25, 2006 9:49:31 PM Subject: does anyone know of a 'smart' categorizing text pattern finder? Hi, I wonder if anyone here knows if there is a 'smart' text pattern finder, ideally written in Java. The library I'm looking for should be able to 'guess' the category of the particular text on the page, most probably by finding similarities between the bulk of the pages and a set of templates. Eg, many forums are powered by phpbb, which structures 99% of the pages (except for some title pages user profile pages) in a very similar fashion (page is broken into blocks, each block is broken into further blocks, etc). By comparing many pages with each other (eg, from the same domain root: forum.springframework.org) it should be possible to detect common ('template decorations') and page specific (actual content, like 'user name' and 'posting body') parts. After that it should further be possible, by comparing 'template decorations' parts to a set of templates, to 'guess' the nature of each of the 'page specific' block (eg, 'Vladimir Olenin' in the left side column will be marked as 'name', while whatever is adjucent to this column is the post body). So, I wonder if anyone knows of a package capable of such things. Primary goal though is simplier: to be able to parse out just posters' names from message boards. Though sometimes the 'block category' can be derived from CSS class name of the tags around the text, it's very often not the case. Might Nutch have similar functionality built into their crawler? Thanks. Vlad - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spell checker with lucene
Lucene-based one is described on the Wiki. Another one is the one from LingPipe. It may not be free, depending on what you do with it. Otis - Original Message From: Bhavin Pandya [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 8:50:14 AM Subject: spell checker with lucene Hi, Do anybody have idea for spell checker in java. I want to use with lucene...but which must work well for phrases also... -Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caused by: java.io.IOException: The handle is invalid
Van Nguyen wrote: I only get this error when using the server version of jvm.dll with my JBoss app server… but when I use the client version of jvm.dll, the same index builds just fine. This is an odd error. Which OS are you running on? And, what kind of filesystem is the index directory on? It's surprising that client vs server JRE causes this. Is the exception easily reproduced or is it intermittent? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to remove duplicate records from result
You could do it with a custom HitCollector, no? Otis - Original Message From: Bhavin Pandya [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 8:43:56 AM Subject: How to remove duplicate records from result Hi, I searched the index and i found say 1000 records but out of that 1000 records i want to filter duplicate records based on value of one field. is there any way except looping through whole Hit object ? Because it wont work when number of hit is too large... Thanks. Bhavin pandya - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Ordered positions
Hi In javadoc IndexReader.termPositions() maps to the definition : Term=docNum, freq, pos1, pos2, ... posfreq-1 * where returned enumeration is ordered by doc number. Are positions ordered for each doc or not ? Thanks Olivier - Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son interface révolutionnaire.
Re: Advice on Custom Sorting
Thanks again Erick for taking the time. I agree that the CachingWrapperFilter as described under using a custom filter in LIA is probably my best bet. I wanted to check if anything had been added in Lucene releases since the book was written I wasn't aware of. Cheers again. --- Erick Erickson [EMAIL PROTECTED] wrote: You were probably right. See below On 9/25/06, Paul Lynch [EMAIL PROTECTED] wrote: Thanks for the quick response Erick. index the documents in your preferred list with a field and index your non-preferred docs with a field subid? I considered this approach and dismissed it due to the actual list of preferred ids changing so frequently (every 10 mins...ish) but maybe I was a little hasty in doing so. I will investigate the overhead in updating all docs in the index each time my list refreshes. I had assumed it was too prohibitive but I know what they say about assumptions :) Lots of overhead. There's really no capability of updating a doc in place. This has been on several people's wish-list. You'd have to delete every doc that you wanted to change and re-add it. I don't know how many documents this would be, if just a few it'd be OK, but if many I was assuming (and I *do* know what they say about assumptions G) that you were just adding to your preferred doc list every few minutes, not changing existing documents It really does sound like you want a filter. I was pleasantly surprised by how very quickly a filters are built. You could use a CachingWrapperFilter to have the filter kept around automatically (I guess you'd only have one per index update) to minimize your overhead for building filters, and perhaps warm up your cache by firing a canned query at your searcher when you re-open your IndexReader after index update. I think you'd have to do the two-query thing in this case. If you wanted to really get exotic, you could build your filter when you created your index and store it in a *very special document* and just read it in the first time you needed it. Although I've never used it, I guess you can store binary data. From the Javadoc *Fieldfile:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.html#Field%28java.lang.String,%20byte%5B%5D,%20org.apache.lucene.document.Field.Store%29 *(String http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html name, byte[] value, Field.Storefile:///C:/lucene-2.0.0/docs/api/org/apache/lucene/document/Field.Store.html store) Create a stored field with binary value. The only thing here is that the filters (probably wrapped in a ConstantScoreQuery) lose relevance, but since you're sorting one of several ways, that probably doesn't matter. Best Erick Should I be able to make this workable, the beauty of this solution would be that I would actually only need to query once. If I had a field which indicates whether it is a preferred doc or not, all I will have to do is sort across the two fields. Thanks again Erick. Any other suggestions are most welcome. Regards, Paul --- Erick Erickson [EMAIL PROTECTED] wrote: OK, a really off the top of my head response, but what the heck I'm not sure you need to worry about filters. Would it work for you to index the documents in your preferred list with a field (called, at the limit of my creativity, preferredsubid G) and index your non-preferred docs with a field subid? You'd still have to fire two queries, one on subid (to pick up the ones in your non-preferred list) and one on preferredsubid. Since there's no requirement that all docs have the same fields, your preferred docs could have ONLY the preferredsubid field and your non-preferred docs ONLY the subid field. That way you wouldn't have to worry about picking the docs up twice. Merging should be simple then, just iterate over however many hits you want in your preferredHits object, then tack on however many you want from your nonPreferredHits object. All the code for the two queries would be identical, the only difference being whether you specify subid or preferredsubid.. I can imagine several variations on this scenario, but they depend on your problem space. Whether this is the best or not, I leave as an exercise for the reader. Best Erick On 9/25/06, Paul Lynch [EMAIL PROTECTED] wrote: Hi All, I have an index containing documents which all have a field called SubId which holds the ID of the Subscriber that submitted the data. This field is STORED and UN_TOKENIZED When I am querying the index, the user can cloose a number of different ways to sort the Hits. The problem is that I have a list of SubIds that should appear at the top of the results list regardless of how the
Re: Where to find drill-down examples (source code)
I there a link to a zip file where I can get the entire package of source files (version 2, please). I know I am able to view them in the Source Repository (http://svn.apache.org/viewvc/lucene/java/trunk/), but I do not really feel like going through each of those to download them all. I am looking for a one stop shop here. Miles Barr-3 wrote: Martin Braun wrote: I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have found in the List-Archives so far is that I have to use Filters for my search. Does anybody knows where to find some Source Code, to get an Idea how to implement this? I think that's a useful property for a search engine, so are there any contributions for Lucene for that? If you want to do a refined search I'd put the original query in a QueryFilter, which filters on the new search. http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html e.g. Query original = // saved from the last time the search was executed QueryFilter filter = new QueryFilter(original); QueryParser parser = ... Searcher searcher = ... String userQuery; Query query = parser.parse(userQuery); Hits hits = searcher.search(query, filter); Fill in the blanks with however you normally get your QueryParser and IndexSearcher. You could store the old query on the session, or somewhere else. Then the QueryFilter will ensure you're doing a refinement, but won't affect the scoring in the new search. Alternatively, since you appear to only want to refine on dates and categories, you might want to put them in filters so they don't affect the score, and leave the query as is. In which case you can use a RangeQuery for the dates, and a wrap a TermQuery in a QueryFilter to handle the categories. If you need multiple filters you can use the ChainedFilter class. Miles - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Where-to-find-drill-down-examples-%28source-code%29-tf1980330.html#a6512411 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spell checker with lucene
On Sep 26, 2006, at 8:50 AM, Bhavin Pandya wrote: Hi, Do anybody have idea for spell checker in java. I want to use with lucene...but which must work well for phrases also... -Bhavin pandya When I googled java spell check open source I found http://jazzy.sourceforge.net/ I have looked at it. Are you thinking of doing a spell check on the queries people type? It might be better simply to check each word and see if it is found in the index. That will be a lot less work than adapting the spell checker to Lucene. Bill Taylor - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
cache persistent Hits
Hi, Lucene has itself volatile caching mechanism provided by a weak HashMap. Is there a possibilty to serialize the Hits Object? I think of a HashMap that for each found result, caches the first 100 results. Is it possible to implement such a feature or is there such an extension? My problem is that the searching of my application with an index with the size of 212MB takes to much time, despite I set the BooleanOperator from OR to AND I am happy about every suggestion. Greetings Gaston. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Where to find drill-down examples (source code)
Either you grap the next best svn client and check out the branch of 2.0 or you just download the source dist from a mirror. use this one http://mirrorspace.org/apache/lucene/java/ best regards simon On 9/26/06, djd0383 [EMAIL PROTECTED] wrote: I there a link to a zip file where I can get the entire package of source files (version 2, please). I know I am able to view them in the Source Repository (http://svn.apache.org/viewvc/lucene/java/trunk/), but I do not really feel like going through each of those to download them all. I am looking for a one stop shop here. Miles Barr-3 wrote: Martin Braun wrote: I want to realize a drill-down Function aka narrow search aka refine search. I want to have something like: Refine by Date: * 1990-2000 (30 Docs) * 2001-2003 (200 Docs) * 2004-2006 (10 Docs) But not only DateRanges but also for other Categories. What I have found in the List-Archives so far is that I have to use Filters for my search. Does anybody knows where to find some Source Code, to get an Idea how to implement this? I think that's a useful property for a search engine, so are there any contributions for Lucene for that? If you want to do a refined search I'd put the original query in a QueryFilter, which filters on the new search. http://lucene.apache.org/java/docs/api/org/apache/lucene/search/QueryFilter.html e.g. Query original = // saved from the last time the search was executed QueryFilter filter = new QueryFilter(original); QueryParser parser = ... Searcher searcher = ... String userQuery; Query query = parser.parse(userQuery); Hits hits = searcher.search(query, filter); Fill in the blanks with however you normally get your QueryParser and IndexSearcher. You could store the old query on the session, or somewhere else. Then the QueryFilter will ensure you're doing a refinement, but won't affect the scoring in the new search. Alternatively, since you appear to only want to refine on dates and categories, you might want to put them in filters so they don't affect the score, and leave the query as is. In which case you can use a RangeQuery for the dates, and a wrap a TermQuery in a QueryFilter to handle the categories. If you need multiple filters you can use the ChainedFilter class. Miles - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/Where-to-find-drill-down-examples-%28source-code%29-tf1980330.html#a6512411 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching for the part of a term.
: Since the overhead in first is the speed of the system, i think adopting : second method will be better. : : Is there any other solution for this problem?? Am i going in right : direction?? you're definitely on teh right path -- those are the two bigsolutions i can think of, which appraoch you should take really depends on the nature of your data, what your performance concerns are, and how much development time you have. Here's another good thread you may want to check out... http://www.nabble.com/I-just-don%27t-get-wildcards-at-all.-tf1412243.html#a3804223 -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cache persistent Hits
Well, my index is over 1.4G, and others are reporting very large indexes in the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be very, very, very surprised if it was. Three things spring immediately to mind. First, opening an IndexSearcher is a slow operation. Are you opening a new IndexSearcher for each query? If so, don't G. You can re-use the same searcher across threads without fear and you should *definitely* keep it open between queries. Second, your query could just be very, very interesting. It would be more helpful if you posted an example of the code where you take your timings (including opening the IndexSearcher). Third, if you're using a Hits object to iterate over many documents, be aware that it re-executes the query every hundred results or so. You want to use one of the HitCollector/TopDocs/TopDocsCollector classes if you are iterating over all the returned documents. And you really *don't* want to do an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document. If none of this helps, please post some code fragments and I'm sure others will chime in. Best Erick On 9/26/06, Gaston [EMAIL PROTECTED] wrote: Hi, Lucene has itself volatile caching mechanism provided by a weak HashMap. Is there a possibilty to serialize the Hits Object? I think of a HashMap that for each found result, caches the first 100 results. Is it possible to implement such a feature or is there such an extension? My problem is that the searching of my application with an index with the size of 212MB takes to much time, despite I set the BooleanOperator from OR to AND I am happy about every suggestion. Greetings Gaston. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re[3]: how to enhance speed of sorted search
: I am thinking should be this faster The ConstantScoreQuery wrapped arround the QueryFilter might in fact be faster then the raw query -- have your tried it to see? you might be able to shave a little bit of speed off by accessing the bits from the Filter directly and iterating over them yourself to check the FieldCache ad build up your sorted list of the first N -- i think that would save you one method call per match (the score method of ConstantScoreQuery) At some point you just have to wonder if it's fast enough? how long does a typically sorted query take for you right now? how many documents are in your index? how many matches do you typically have? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very high fieldNorm for a field resulting in bad results
: The symptom: : Very high fieldNorm for field A.(explain output pasted below) The boost i am : applying to the troublesome field is 3.5 the max boost applied per doc is : 1.8 : Given that information, the very high fieldNorm is very surprising to me. : Based on what I read, FieldNorm = 1 / sqrt(sum of terms) , possibly : multiplied by field boost values. The value of a the field norm for any field named A is typically the lengthNorm of the field, times the document boost, times the field boost for *each* Field instance added to the document with the name A. (lengthNorm is by default 1/swrt(num of terms)) so in your situation... : for (Collection of values){ : Field thisField = new Field(fieldName, value, fieldConfig.STORED, : fieldConfig.INDEXED); : thisField.setBoost(fieldConfig); : doc.add(thisField); the fieldNorm for A is going to be fieldConfig * values.size() * any document boost you didn't mention using * length norm. : which should basically lead to the values being appended, : Am i making a mistake in the way I am adding fields ? the way you are adding fields is the proper way to deal with multi-value fields in my opinion, but it may be leading to more boost then you intended, in which case only boosting the first Field may be the way to go. another aspect of this to keep in mind, is htat since fieldNorms are stored as a single byte encoded float, some precission is lost ... the byte encoding for the norms is targeted at smaller values, so with really big norms you might find the problem exaserbated by the rounding. play around with your boost values - you can use indexReader.norms(A) along with similarity.decodeNorm to see what norm values your various documents are getting as you tweak your numbers. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cache persistent Hits
hi, first thank you for the fast reply. I use MultiSearcher that opens 3 indexes, so this makes the whole operation surly slower, but 20seconds for 5260 results out of an 212MB index is much too slow. Another reason can of course be my ISP. Here is my code: IndexSearcher[] searchers; searchers=new IndexSearcher[3]; String path=/home/sn/public_html/; searchers[0]=new IndexSearcher(path+index1); searchers[1]=new IndexSearcher(path+index2); searchers[2]=new IndexSearcher(path+index3); MultiSearcher saercher=new MultiSearcher(searchers); QueryParser parser=new QueryParser(content,new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query=parser.parse(urlName:+userInput+ OR +content:+userInput); Hits hits=searcher.search(query); for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } // Outprint only 10 results per page for(int i=startPoint;istartPoint+10;i++) { Document doc=hits.doc(i); out.println(escapeHTML(doc.get(description))+p); out.println(a href=+doc.get(url)++doc.get(url).substring(7)+/a); out.println(ppp); } Perhaps somebody see the reason why it is so slow. Thank you in advance Greetings Gaston Erick Erickson schrieb: Well, my index is over 1.4G, and others are reporting very large indexes in the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be very, very, very surprised if it was. Three things spring immediately to mind. First, opening an IndexSearcher is a slow operation. Are you opening a new IndexSearcher for each query? If so, don't G. You can re-use the same searcher across threads without fear and you should *definitely* keep it open between queries. Second, your query could just be very, very interesting. It would be more helpful if you posted an example of the code where you take your timings (including opening the IndexSearcher). Third, if you're using a Hits object to iterate over many documents, be aware that it re-executes the query every hundred results or so. You want to use one of the HitCollector/TopDocs/TopDocsCollector classes if you are iterating over all the returned documents. And you really *don't* want to do an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document. If none of this helps, please post some code fragments and I'm sure others will chime in. Best Erick On 9/26/06, Gaston [EMAIL PROTECTED] wrote: Hi, Lucene has itself volatile caching mechanism provided by a weak HashMap. Is there a possibilty to serialize the Hits Object? I think of a HashMap that for each found result, caches the first 100 results. Is it possible to implement such a feature or is there such an extension? My problem is that the searching of my application with an index with the size of 212MB takes to much time, despite I set the BooleanOperator from OR to AND I am happy about every suggestion. Greetings Gaston. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene In Action Book vs Lucene 2.0
Hi, I have bought the Lucene In Action Book for more than a year now, and was using Lucene 1.x during that time. Now, I have a new project with Lucene and Lucene is now 2.0. Many APIs seems to have changed. I would like to ask the experts here, what are the important or substantial changes from Lucene 1.x to 2.0? Which part of the LIA book that is still usable and which part is not? Any particular things that a new Lucene 2.0user that only have used the 1.x version, should pay attention to? Thanks.
Re: cache persistent Hits
See below. On 9/26/06, Gaston [EMAIL PROTECTED] wrote: hi, first thank you for the fast reply. I use MultiSearcher that opens 3 indexes, so this makes the whole operation surly slower, but 20seconds for 5260 results out of an 212MB index is much too slow. Another reason can of course be my ISP. Here is my code: IndexSearcher[] searchers; searchers=new IndexSearcher[3]; String path=/home/sn/public_html/; searchers[0]=new IndexSearcher(path+index1); searchers[1]=new IndexSearcher(path+index2); searchers[2]=new IndexSearcher(path+index3); MultiSearcher saercher=new MultiSearcher(searchers); Above you've opened the searcher for each search, exactly as I feared. This is a major hit. Don't do this, but keep the searchers open between calls. You can demonstrate this to yourself by returning time intervals in your HTML page. Take one timestamp right here, one after a new dummy query that you make up and hard-code, and one after the real query you already have below. Return them all in your HTML page and take a look. I think you'll see that the first query takes a while, and the second is very fast. And don't iterate over all the hits (more below). QueryParser parser=new QueryParser(content,new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query=parser.parse(urlName:+userInput+ OR +content:+userInput); Hits hits=searcher.search(query); for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } what is the purpose of iteration above? This does nothing except waste time. I'd just remove it (unless there's something else you're doing here that you left out). If you're trying to get to the startPoint below, well, there's no reason to iterate above, just to directly to the loop below. For 5000 hits, you're repeating the search 50 times or so, as has been discussed in these archives repeatedly. See my previous mail. // Outprint only 10 results per page for(int i=startPoint;istartPoint+10;i++) { Document doc=hits.doc(i); out.println(escapeHTML(doc.get(description))+p); out.println(a href=+doc.get(url)++doc.get(url).substring(7)+/a); out.println(ppp); } Perhaps somebody see the reason why it is so slow. Thank you in advance Greetings Gaston I'm assuming that your ISP comment is just where you're getting your page from, and that your searchers and indexes are at least on the same network and NOT separated by the web, as that would be slow and hard to fix. To get a sense of where you're really spending your time, I'd actually get the system time at various points in the process and send the *times* back in your HTML page. That'll give you a much better sense of where you're actually spending time. You can't really tell anything by measuring now long it takes to get your HTML page back, you've *got* to measure at discreet points in the code and return those. 5,000+ results should not be taking 20 seconds. I strongly suspect that the fact that you're opening your searchers every time and uselessly iterating through all the hits is the culprit. If I remember correctly, and you have 5,000 documents, you're executing the query about 50 times when you iterate through all the hits. Under the covers, Hits is optimized for about 100 results. As you iterate through, each next 100 re-executes the query. You could search the mail archive for this topic, maybe hits slow or some such for greater explications. Hope this helps Erick Erick Erickson schrieb: Well, my index is over 1.4G, and others are reporting very large indexes in the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be very, very, very surprised if it was. Three things spring immediately to mind. First, opening an IndexSearcher is a slow operation. Are you opening a new IndexSearcher for each query? If so, don't G. You can re-use the same searcher across threads without fear and you should *definitely* keep it open between queries. Second, your query could just be very, very interesting. It would be more helpful if you posted an example of the code where you take your timings (including opening the IndexSearcher). Third, if you're using a Hits object to iterate over many documents, be aware that it re-executes the query every hundred results or so. You want to use one of the HitCollector/TopDocs/TopDocsCollector classes if you are iterating over all the returned documents. And you really *don't* want to do an IndexReader.doc(doc#) or Searcher.doc(doc#) on every document. If none of this helps, please post some code fragments and I'm sure others will chime in. Best Erick On 9/26/06, Gaston [EMAIL PROTECTED] wrote: Hi, Lucene has itself volatile caching mechanism provided
spell checker
Does anyone have sample code on how to build a dictionary? I found this article online and but it uses version 1.4.3 and it doesn't seem to work on 2.0.0: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 Here's the code I have: indexReader = IndexReader.open(originalIndexDirectory); Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired); SpellChecker spellChckr = new SpellChecker(spellIndexDirectory); spellChckr.indexDictionary(dictionary); I'm getting a null pointer exception when I call indexDirectory(). Here's how I index the field experience_desired: doc.add(new Field(experience_desired, value, Field.Store.NO, Field.Index.TOKENIZED)); Is there another way I should do it so there is a way to build a dictionary on that field? Thanks Chris Salem 440.946.5214 x5458 [EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED]
spell checker
Does anyone have sample code on how to build a dictionary? I found this article online and but it uses version 1.4.3 and it doesn't seem to work on 2.0.0: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 Here's the code I have: indexReader = IndexReader.open(originalIndexDirectory); Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired); SpellChecker spellChckr = new SpellChecker(spellIndexDirectory); spellChckr.indexDictionary(dictionary); I'm getting a null pointer exception when I call indexDirectory(). Here's how I index the field experience_desired: doc.add(new Field(experience_desired, value, Field.Store.NO, Field.Index.TOKENIZED)); Is there another way I should do it so there is a way to build a dictionary on that field? Thanks Chris Salem 440.946.5214 x5458 [EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED]
Re: cache persistent Hits
Hi Erick, the problem was this piece of code I don't need anymore. for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } Now it is very fast, thank you very much for your email that is written in detail. Here is my application, that still is in development phase. http://www.suchste.de Greetings Gaston P.S. The search for 'web' delivers over 5000 hits... Erick Erickson schrieb: See below. On 9/26/06, Gaston [EMAIL PROTECTED] wrote: hi, first thank you for the fast reply. I use MultiSearcher that opens 3 indexes, so this makes the whole operation surly slower, but 20seconds for 5260 results out of an 212MB index is much too slow. Another reason can of course be my ISP. Here is my code: IndexSearcher[] searchers; searchers=new IndexSearcher[3]; String path=/home/sn/public_html/; searchers[0]=new IndexSearcher(path+index1); searchers[1]=new IndexSearcher(path+index2); searchers[2]=new IndexSearcher(path+index3); MultiSearcher saercher=new MultiSearcher(searchers); Above you've opened the searcher for each search, exactly as I feared. This is a major hit. Don't do this, but keep the searchers open between calls. You can demonstrate this to yourself by returning time intervals in your HTML page. Take one timestamp right here, one after a new dummy query that you make up and hard-code, and one after the real query you already have below. Return them all in your HTML page and take a look. I think you'll see that the first query takes a while, and the second is very fast. And don't iterate over all the hits (more below). QueryParser parser=new QueryParser(content,new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query=parser.parse(urlName:+userInput+ OR +content:+userInput); Hits hits=searcher.search(query); for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } what is the purpose of iteration above? This does nothing except waste time. I'd just remove it (unless there's something else you're doing here that you left out). If you're trying to get to the startPoint below, well, there's no reason to iterate above, just to directly to the loop below. For 5000 hits, you're repeating the search 50 times or so, as has been discussed in these archives repeatedly. See my previous mail. // Outprint only 10 results per page for(int i=startPoint;istartPoint+10;i++) { Document doc=hits.doc(i); out.println(escapeHTML(doc.get(description))+p); out.println(a href=+doc.get(url)++doc.get(url).substring(7)+/a); out.println(ppp); } Perhaps somebody see the reason why it is so slow. Thank you in advance Greetings Gaston I'm assuming that your ISP comment is just where you're getting your page from, and that your searchers and indexes are at least on the same network and NOT separated by the web, as that would be slow and hard to fix. To get a sense of where you're really spending your time, I'd actually get the system time at various points in the process and send the *times* back in your HTML page. That'll give you a much better sense of where you're actually spending time. You can't really tell anything by measuring now long it takes to get your HTML page back, you've *got* to measure at discreet points in the code and return those. 5,000+ results should not be taking 20 seconds. I strongly suspect that the fact that you're opening your searchers every time and uselessly iterating through all the hits is the culprit. If I remember correctly, and you have 5,000 documents, you're executing the query about 50 times when you iterate through all the hits. Under the covers, Hits is optimized for about 100 results. As you iterate through, each next 100 re-executes the query. You could search the mail archive for this topic, maybe hits slow or some such for greater explications. Hope this helps Erick Erick Erickson schrieb: Well, my index is over 1.4G, and others are reporting very large indexes in the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be very, very, very surprised if it was. Three things spring immediately to mind. First, opening an IndexSearcher is a slow operation. Are you opening a new IndexSearcher for each query? If so, don't G. You can re-use the same searcher across threads without fear and you should *definitely* keep it open between queries. Second, your query could just be very, very interesting. It would be more helpful if you posted an example of the code where you take your timings (including opening the IndexSearcher). Third, if you're using a Hits object to iterate over many documents, be aware that it re-executes the query every hundred results or
term OR term OR term OR .... query question
Hi. I have a question regarding Lucene scoring algorithm. Providing I have a query a OR b OR c OR d OR e OR f, and two documents: doc1 a b c d and doc2 d e, will doc1 score higher than doc2? In other words, does Lucene takes into account the number of terms matched in the document in case of the 'or' query? Providing that I don't know the algorithms behind the Lucene, how does 'or' query time depends on the number of searched terms? Does it grow linierly, exponentially? How does 'and' query time depends on the number of searched terms? (it should decrease, right?) Thanks. Vlad
Re: Re[2]: how to enhance speed of sorted search
On 9/26/06, Chris Hostetter [EMAIL PROTECTED] wrote: if you are seeing slow performance from sorted searches, the time spent scoring the results isn't the biggest contributor to how long thesearch takes -- it tends to be negligable for most queries. I've many times wished for a visiting score mechanism of some kind. Turn it off and save CPU, remove floating points, or even hide a global sort order in the norms. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
how to get results without getting total number of found documents?
Hi. I couldn't find the answer to this question in the mailing list archive. In case I missed it, please let me know the keyword phrase I should be looking for, if not a direct link. All the 'Lucene' powered implementations I saw (well, primarily those utilizing Solr) return exact count of the number of documents found. It means that the query is resolved across the whole data set in precise fashion. If the number of searched documents is huge (eg, 1billion), this should present quite a problem. I wonder if that's the default behaviour of Lucene or rather the frameworks that utilize it? Is it possible to: - get the top 1000 results WITHOUT executing query across whole data set - in other words, can Lucene: - chunk out top X results by 'approximate' fast search, which will return _approximate_ total number of found documents, similar to 'Google' total pages found count - and perform more accurate search within that chunk Is such functionality built in or it has be customized? If it's built-in, what algorithms are used to 'chunk out' the results and get approximate docs count? What classes should I look at? Thanks! Vlad PS: it's pretty much the functionality Google has - you can't get more than 1000 matches per query (meaning, you can get even '10M' documents found, but if you'll try to browse beyond '1000' results, you'll get an error page).
Re: Re[2]: how to enhance speed of sorted search
Paul's Matcher in Jira will almost enable this, indirectly but possible - Original Message From: karl wettin [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, 26 September, 2006 11:30:24 PM Subject: Re: Re[2]: how to enhance speed of sorted search On 9/26/06, Chris Hostetter [EMAIL PROTECTED] wrote: if you are seeing slow performance from sorted searches, the time spent scoring the results isn't the biggest contributor to how long thesearch takes -- it tends to be negligable for most queries. I've many times wished for a visiting score mechanism of some kind. Turn it off and save CPU, remove floating points, or even hide a global sort order in the norms. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to get results without getting total number of found documents?
- get the top 1000 results WITHOUT executing query across whole data set (Apologies if this is telling something you are already fully aware of ) - Counting matches doesn't involve scanning the text of all the docs so may be less expensive than you think for a single index. It very quickly looks up and ranks only the docs containing your search terms so a total match count is not an expensive by-product of this operation - see a description of inverted indexes for more details: http://en.wikipedia.org/wiki/Inverted_index If you're aware of all that and considering larger scale problems (billions of docs) where multiple machines/indexes must be queried in parallel things are more complex. The cost of combining result scores from multiple machines is typically why you can't page beyond 1000 results. Some of these large distributed architectures will divide content into popular/recent content and older/less popular content. Approximations for total number of matching docs are calculated based on queries executed solely on the subset of popular stuff. Only queries with insufficient matches in popular content will resort to querying the older stuff. Cheers Mark Vladimir Olenin wrote: Hi. I couldn't find the answer to this question in the mailing list archive. In case I missed it, please let me know the keyword phrase I should be looking for, if not a direct link. All the 'Lucene' powered implementations I saw (well, primarily those utilizing Solr) return exact count of the number of documents found. It means that the query is resolved across the whole data set in precise fashion. If the number of searched documents is huge (eg, 1billion), this should present quite a problem. I wonder if that's the default behaviour of Lucene or rather the frameworks that utilize it? Is it possible to: - get the top 1000 results WITHOUT executing query across whole data set - in other words, can Lucene: - chunk out top X results by 'approximate' fast search, which will return _approximate_ total number of found documents, similar to 'Google' total pages found count - and perform more accurate search within that chunk Is such functionality built in or it has be customized? If it's built-in, what algorithms are used to 'chunk out' the results and get approximate docs count? What classes should I look at? Thanks! Vlad PS: it's pretty much the functionality Google has - you can't get more than 1000 matches per query (meaning, you can get even '10M' documents found, but if you'll try to browse beyond '1000' results, you'll get an error page). ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: cache persistent Hits
Glad I could help. I don't read a word of German, but even I could see the 227 milliseconds at the bottom G. Glad things are working for you. Erick On 9/26/06, Gaston [EMAIL PROTECTED] wrote: Hi Erick, the problem was this piece of code I don't need anymore. for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } Now it is very fast, thank you very much for your email that is written in detail. Here is my application, that still is in development phase. http://www.suchste.de Greetings Gaston P.S. The search for 'web' delivers over 5000 hits... Erick Erickson schrieb: See below. On 9/26/06, Gaston [EMAIL PROTECTED] wrote: hi, first thank you for the fast reply. I use MultiSearcher that opens 3 indexes, so this makes the whole operation surly slower, but 20seconds for 5260 results out of an 212MB index is much too slow. Another reason can of course be my ISP. Here is my code: IndexSearcher[] searchers; searchers=new IndexSearcher[3]; String path=/home/sn/public_html/; searchers[0]=new IndexSearcher(path+index1); searchers[1]=new IndexSearcher(path+index2); searchers[2]=new IndexSearcher(path+index3); MultiSearcher saercher=new MultiSearcher(searchers); Above you've opened the searcher for each search, exactly as I feared. This is a major hit. Don't do this, but keep the searchers open between calls. You can demonstrate this to yourself by returning time intervals in your HTML page. Take one timestamp right here, one after a new dummy query that you make up and hard-code, and one after the real query you already have below. Return them all in your HTML page and take a look. I think you'll see that the first query takes a while, and the second is very fast. And don't iterate over all the hits (more below). QueryParser parser=new QueryParser(content,new StandardAnalyzer()); parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND); Query query=parser.parse(urlName:+userInput+ OR +content:+userInput); Hits hits=searcher.search(query); for(int i=0;ihits.length();i++) { Document doc=hits.doc(i); } what is the purpose of iteration above? This does nothing except waste time. I'd just remove it (unless there's something else you're doing here that you left out). If you're trying to get to the startPoint below, well, there's no reason to iterate above, just to directly to the loop below. For 5000 hits, you're repeating the search 50 times or so, as has been discussed in these archives repeatedly. See my previous mail. // Outprint only 10 results per page for(int i=startPoint;istartPoint+10;i++) { Document doc=hits.doc(i); out.println(escapeHTML(doc.get(description))+p); out.println(a href=+doc.get(url)++doc.get(url).substring(7)+/a); out.println(ppp); } Perhaps somebody see the reason why it is so slow. Thank you in advance Greetings Gaston I'm assuming that your ISP comment is just where you're getting your page from, and that your searchers and indexes are at least on the same network and NOT separated by the web, as that would be slow and hard to fix. To get a sense of where you're really spending your time, I'd actually get the system time at various points in the process and send the *times* back in your HTML page. That'll give you a much better sense of where you're actually spending time. You can't really tell anything by measuring now long it takes to get your HTML page back, you've *got* to measure at discreet points in the code and return those. 5,000+ results should not be taking 20 seconds. I strongly suspect that the fact that you're opening your searchers every time and uselessly iterating through all the hits is the culprit. If I remember correctly, and you have 5,000 documents, you're executing the query about 50 times when you iterate through all the hits. Under the covers, Hits is optimized for about 100 results. As you iterate through, each next 100 re-executes the query. You could search the mail archive for this topic, maybe hits slow or some such for greater explications. Hope this helps Erick Erick Erickson schrieb: Well, my index is over 1.4G, and others are reporting very large indexes in the 10s of gigabytes. So I suspect your index size isn't the issue. I'd be very, very, very surprised if it was. Three things spring immediately to mind. First, opening an IndexSearcher is a slow operation. Are you opening a new IndexSearcher for each query? If so, don't G. You can re-use the same searcher across threads without fear and you should *definitely* keep it open between queries. Second, your query could just be very, very interesting. It would be more
RE: how to get results without getting total number of found documents?
Thanks, Mark, that clears things up a bit. No need to appologise - I am quite a novice with Lucene. To explain my concern a bit, assume that your inverted index is queried with 'or' query for the most 'common' terms (ie, after excluding such denominators as 'a', 'the', etc). Let's say, you have following terms: - 'work': occurs in 200M documents - 'java': occurs in 100M documents - '.net': occurs in 100M documents Now, if I'm doing a query: 'work OR java OR .net' the total result set should be somewhere between 200M and 400M, right? But to get the exact number you'll actually need to make the union of ALL the document Ids, which means you'd have to loop through 400M ids at least. For more complex queries the cost should be higher. The sorting step should be quite expensive for the 'whole' dataset as well. The intersect should be cheaper because each step eliminates some number of documents. In the implementations I saw/did in the past the ability (or, to be more correct, Unability) to create this kind of 'approximation' and chunk out 'most significant' results was the main limiting factor of all the algorithms. Thanks! Vlad PS: by 'computationally expensive', I mean 'scalability' as well - all this operations are very CPU intensive, so if one query returns within 1-2 seconds during which time takes up 100% of CPU time, it means that for 20 concurrent users the response time for the 'most unlucky' one would be somewhere between 20 and 40 seconds. -Original Message- From: markharw00d [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 26, 2006 6:35 PM To: java-user@lucene.apache.org Subject: Re: how to get results without getting total number of found documents? - get the top 1000 results WITHOUT executing query across whole data set (Apologies if this is telling something you are already fully aware of ) - Counting matches doesn't involve scanning the text of all the docs so may be less expensive than you think for a single index. It very quickly looks up and ranks only the docs containing your search terms so a total match count is not an expensive by-product of this operation - see a description of inverted indexes for more details: http://en.wikipedia.org/wiki/Inverted_index If you're aware of all that and considering larger scale problems (billions of docs) where multiple machines/indexes must be queried in parallel things are more complex. The cost of combining result scores from multiple machines is typically why you can't page beyond 1000 results. Some of these large distributed architectures will divide content into popular/recent content and older/less popular content. Approximations for total number of matching docs are calculated based on queries executed solely on the subset of popular stuff. Only queries with insufficient matches in popular content will resort to querying the older stuff. Cheers Mark Vladimir Olenin wrote: Hi. I couldn't find the answer to this question in the mailing list archive. In case I missed it, please let me know the keyword phrase I should be looking for, if not a direct link. All the 'Lucene' powered implementations I saw (well, primarily those utilizing Solr) return exact count of the number of documents found. It means that the query is resolved across the whole data set in precise fashion. If the number of searched documents is huge (eg, 1billion), this should present quite a problem. I wonder if that's the default behaviour of Lucene or rather the frameworks that utilize it? Is it possible to: - get the top 1000 results WITHOUT executing query across whole data set - in other words, can Lucene: - chunk out top X results by 'approximate' fast search, which will return _approximate_ total number of found documents, similar to 'Google' total pages found count - and perform more accurate search within that chunk Is such functionality built in or it has be customized? If it's built-in, what algorithms are used to 'chunk out' the results and get approximate docs count? What classes should I look at? Thanks! Vlad PS: it's pretty much the functionality Google has - you can't get more than 1000 matches per query (meaning, you can get even '10M' documents found, but if you'll try to browse beyond '1000' results, you'll get an error page). ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: how to get results without getting total number of found documents?
Vlad, Please check published papers on sampling inverted indexes and multi-level caching - this is most probably what Google and other major search engines use. You can see a simple implementation of this principle in Nutch - the index is sorted in decreasing order by a PageRank-like score (the logic for this is in IndexSorter.java), and then when running a query we only collect top-N results, and extrapolate total numbers over the whole collection, assuming certain model of term distributions (LuceneQueryOptimizer.LimitedCollector). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spell checker
I've added a FAQ that may help you with this, How do i get code written for Lucene 1.4.x to work with Lucene 2.x? http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-d09fdfc8a6335eab4e3f3dc8ac41a40a3666318e : Date: Tue, 26 Sep 2006 20:56:57 - : From: Chris Salem [EMAIL PROTECTED] : Reply-To: java-user@lucene.apache.org, Chris Salem [EMAIL PROTECTED] : To: java-user@lucene.apache.org : Subject: spell checker : : Does anyone have sample code on how to build a dictionary? : : I found this article online and but it uses version 1.4.3 and it doesn't seem to work on 2.0.0: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 : : Here's the code I have: : : indexReader = IndexReader.open(originalIndexDirectory); : Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired); : SpellChecker spellChckr = new SpellChecker(spellIndexDirectory); : spellChckr.indexDictionary(dictionary); : I'm getting a null pointer exception when I call indexDirectory(). : Here's how I index the field experience_desired: : doc.add(new Field(experience_desired, value, Field.Store.NO, Field.Index.TOKENIZED)); : Is there another way I should do it so there is a way to build a dictionary on that field? : : Thanks : : Chris Salem : 440.946.5214 x5458 : [EMAIL PROTECTED] : : (The following links were included with this email:) : http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 : : mailto:[EMAIL PROTECTED] : : : : (The following links were included with this email:) : http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 : : mailto:[EMAIL PROTECTED] : : : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple Terms, Delete From Index
Hi All, I need to delete from the index where 2 terms are matching, rather than just one term. For example, IndexReader reader = IndexReader.open(dir); Term[] terms = new Term[2]; terms[0] = new Term(city,city1); terms[1] = new Term(state,state1); reader.delete(terms); reader.close(); Any suggestions? Thanks in advance, Josh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Very high fieldNorm for a field resulting in bad results
Thanks a lot Chris for the detailed patitent response. The value of a the field norm for any field named A is typically the lengthNorm of the field, times the document boost, times the field boost for *each* Field instance added to the document with the name A. (lengthNorm is by default 1/swrt(num of terms)) That explains the very high value for the fieldNorm. The boost value became boost_vale^#of values in the field. A couple of more questions: 1. Can I do away with index-time boosting for fields tweak query-time boosting for them ? I understand that doc level boosting is very useful while indexing. But for fields, both index-boost query-boost are mutiples which lead to the score, so would it be safe to say that I can replace the index-time boost with query-time boosting. This allows me a lot of freedom to test different values without re-indexing which takes me about 6 hours. 2. When searching through the archive I had read a post by you, saying its possible to give exact matches much higher weightage by indexing the START END from : http://www.nabble.com/What-are-norms--tf1919250.html#a5335856 it is possible to score exact matches on (tokenized) fields very high without using lengthNorm by indexing START and END tokens for the field as well, and then including them in your sloppy phrase queries -- the tighter match will score highest. Can you please elaborate on this, Thanks a ton for the response, mekin - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Terms, Delete From Index
Heh, I have to try the obvious - two reader.delete(term) calls? Otis - Original Message From: Josh Joy [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 10:04:13 PM Subject: Multiple Terms, Delete From Index Hi All, I need to delete from the index where 2 terms are matching, rather than just one term. For example, IndexReader reader = IndexReader.open(dir); Term[] terms = new Term[2]; terms[0] = new Term(city,city1); terms[1] = new Term(state,state1); reader.delete(terms); reader.close(); Any suggestions? Thanks in advance, Josh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: spell checker
The code works with Lucene 2.0, I've used it. However, it did change slightly. If you look in JIRA you'll find some comments about it. If I recall correctly, some changes I made to LuceneDictionary(?) class now require the index directory to existI think. Otis - Original Message From: Chris Salem [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 4:56:57 PM Subject: spell checker Does anyone have sample code on how to build a dictionary? I found this article online and but it uses version 1.4.3 and it doesn't seem to work on 2.0.0: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 Here's the code I have: indexReader = IndexReader.open(originalIndexDirectory); Dictionary dictionary = new LuceneDictionary(indexReader, experience_desired); SpellChecker spellChckr = new SpellChecker(spellIndexDirectory); spellChckr.indexDictionary(dictionary); I'm getting a null pointer exception when I call indexDirectory(). Here's how I index the field experience_desired: doc.add(new Field(experience_desired, value, Field.Store.NO, Field.Index.TOKENIZED)); Is there another way I should do it so there is a way to build a dictionary on that field? Thanks Chris Salem 440.946.5214 x5458 [EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED] (The following links were included with this email:) http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene In Action Book vs Lucene 2.0
Hi, I think you'll find most of the book to still be useful (but then again, I'm the co-author, so maybe I'm not 100% objective). One thing where the API changed is Fields. They are now constructed differently, so the code in the book won't match the current API. We have LIA code working under Lucene 2.0, but we'll have to wait and publish it along with LIA2. It IS coming! :) Otis - Original Message From: KEGan [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 3:32:06 PM Subject: Lucene In Action Book vs Lucene 2.0 Hi, I have bought the Lucene In Action Book for more than a year now, and was using Lucene 1.x during that time. Now, I have a new project with Lucene and Lucene is now 2.0. Many APIs seems to have changed. I would like to ask the experts here, what are the important or substantial changes from Lucene 1.x to 2.0? Which part of the LIA book that is still usable and which part is not? Any particular things that a new Lucene 2.0user that only have used the 1.x version, should pay attention to? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene In Action Book vs Lucene 2.0
Otis, What about the internal of Lucene? Are there any major changes in there? LIA is such a great book. Any date when LIA2 is coming? I definitely must get it :) On 9/27/06, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi, I think you'll find most of the book to still be useful (but then again, I'm the co-author, so maybe I'm not 100% objective). One thing where the API changed is Fields. They are now constructed differently, so the code in the book won't match the current API. We have LIA code working under Lucene 2.0, but we'll have to wait and publish it along with LIA2. It IS coming! :) Otis - Original Message From: KEGan [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 3:32:06 PM Subject: Lucene In Action Book vs Lucene 2.0 Hi, I have bought the Lucene In Action Book for more than a year now, and was using Lucene 1.x during that time. Now, I have a new project with Lucene and Lucene is now 2.0. Many APIs seems to have changed. I would like to ask the experts here, what are the important or substantial changes from Lucene 1.x to 2.0? Which part of the LIA book that is still usable and which part is not? Any particular things that a new Lucene 2.0user that only have used the 1.x version, should pay attention to? Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple Terms, Delete From Index
Hi Otis, Won't that delete all documents with term1, then all documents with term2...rather than deleting all documents that contain only term1 and term2...or am I missing the obvious and doing something wrong? Thanks, Josh Otis Gospodnetic wrote: Heh, I have to try the obvious - two reader.delete(term) calls? Otis - Original Message From: Josh Joy [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, September 26, 2006 10:04:13 PM Subject: Multiple Terms, Delete From Index Hi All, I need to delete from the index where 2 terms are matching, rather than just one term. For example, IndexReader reader =ndexReader.open(dir); Term[] terms =ew Term[2]; terms[0] =ew Term(city,city1); terms[1] =ew Term(state,state1); reader.delete(terms); reader.close(); Any suggestions? Thanks in advance, Josh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]