RE: HITCOLLECTOR+SCORE+DELIMA
On Dec 10, 2004, at 7:39 AM, Karthik N S wrote: I am still in delima on How to use the HitCollector for returning Hits hits between scores 0.2f to 1.0f , There is not a simple example for the same, yet lot's of talk on usage for the same on the form. 1) I am not 100% sure about this but it might work. Add the code starting with in IndexSearcher.java::search() // inherit javadoc public TopDocs search(Query query, Filter filter, final int nDocs) throws IOException { Scorer scorer = query.weight(this).scorer(reader); if (scorer == null) return new TopDocs(0, new ScoreDoc[0]); final BitSet bits = filter != null ? filter.bits(reader) : null; final HitQueue hq = new HitQueue(nDocs); final int[] totalHits = new int[1]; scorer.score(new HitCollector() { public final void collect(int doc, float score) { if (score 0.0f // ignore zeroed buckets score 0.2f score1.0f) (bits==null || bits.get(doc))) {// skip docs not in bits totalHits[0]++; hq.insert(new ScoreDoc(doc, score)); } } }); 2) Filter examples are in Lucene in Action book, Chapter 5. I wrote an example as well: String query = odyssey; BooleanQuery bq = new BooleanQuery(); bq.add(new TermQuery(new Term(content, query)), true, false); BooleanQuery bqf = new BooleanQuery(); bqf.add(new TermQuery(new Term(H2, query)), true, false); Filter f = new QueryFilter(bqf); IndexReader reader = IndexReader.open(new File(dir, index).getCanonicalPath()); Searcher luceneSearcher = new org.apache.lucene.search.IndexSearcher(reader); luceneSearcher.setSimilarity(new NutchSimilarity()); //Logically the following would be executed as follows: Find all //the docs matching bq. Select the ones which matchbqf hits = luceneSearcher.search(bq, f); System.out.print(query: + query); System.out.println(Total hits: + hits.length()); 3) delima is spelled as dilemma -Vikas Gupta - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HITCOLLECTOR+SCORE+DELIMMA
Hi Vikas Gupta Since Erik Replied to me on my last mail, A FILTER cand be built for the same can be to fetch scrores between 0.2f to 1.0f. Can u please spare me some code for the same. [ Sorry for the Spell mistake, My Mail IDE does not have one ] With regards Karthik -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 3:17 PM To: Lucene Users List Subject: RE: HITCOLLECTOR+SCORE+DELIMA On Dec 10, 2004, at 7:39 AM, Karthik N S wrote: I am still in delima on How to use the HitCollector for returning Hits hits between scores 0.2f to 1.0f , There is not a simple example for the same, yet lot's of talk on usage for the same on the form. 1) I am not 100% sure about this but it might work. Add the code starting with in IndexSearcher.java::search() // inherit javadoc public TopDocs search(Query query, Filter filter, final int nDocs) throws IOException { Scorer scorer = query.weight(this).scorer(reader); if (scorer == null) return new TopDocs(0, new ScoreDoc[0]); final BitSet bits = filter != null ? filter.bits(reader) : null; final HitQueue hq = new HitQueue(nDocs); final int[] totalHits = new int[1]; scorer.score(new HitCollector() { public final void collect(int doc, float score) { if (score 0.0f // ignore zeroed buckets score 0.2f score1.0f) (bits==null || bits.get(doc))) {// skip docs not in bits totalHits[0]++; hq.insert(new ScoreDoc(doc, score)); } } }); 2) Filter examples are in Lucene in Action book, Chapter 5. I wrote an example as well: String query = odyssey; BooleanQuery bq = new BooleanQuery(); bq.add(new TermQuery(new Term(content, query)), true, false); BooleanQuery bqf = new BooleanQuery(); bqf.add(new TermQuery(new Term(H2, query)), true, false); Filter f = new QueryFilter(bqf); IndexReader reader = IndexReader.open(new File(dir, index).getCanonicalPath()); Searcher luceneSearcher = new org.apache.lucene.search.IndexSearcher(reader); luceneSearcher.setSimilarity(new NutchSimilarity()); //Logically the following would be executed as follows: Find all //the docs matching bq. Select the ones which matchbqf hits = luceneSearcher.search(bq, f); System.out.print(query: + query); System.out.println(Total hits: + hits.length()); 3) delima is spelled as dilemma -Vikas Gupta - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMMA
Dude, and I say this with love, it's open source, you've got the code, take the initiative, DIY, be creative and share your findings with the rest of us. Personally I would be interested to see how you do this, keep your changes documented and share. Nader Henein Karthik N S wrote: Hi Erik Apologies.. I got Confused with the last mail. Iterate over Hits. returns large hit values and Iteration on Hits for scores consumes time , so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits. Note:- The search is being done on Field Type 'Text' ,consists of 'Contents' from various Html documents Please Advise me Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 5:05 PM To: Lucene Users List Subject: Re: HITCOLLECTOR+SCORE+DELIMA On Dec 13, 2004, at 1:16 AM, Karthik N S wrote: So u say I have to Build a Filter to Collect all the Scores between the 2 Ranges [ 0.2f to 1.0f] My message is being misinterpreted. I said filter as a verb, not a noun. :) In other words, I was not intending to mean write a Filter - a Filter would not be able to filter on score. so the API for the same would be Hits hit = search(Query query, Filter filtertoGetScore) But while writing the Filter Score again depends on Hits Score = hits.score(x); Again, you cannot write a Filter (capital 'F') to deal with score. Please re-read what I said below... Hits are in descending score order, so you may just want to use Hits and filter based on the score provided by hits.score(i). Iterate over Hits... when you encounter scores below your desired range, stop iterating. Why is this simple procedure not good enough for what you are trying to achieve? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMMA
On Dec 13, 2004, at 6:58 AM, Karthik N S wrote: Iterate over Hits. returns large hit values and Iteration on Hits for scores consumes time , so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits. Why do you need to do this *prior* to getting Hits? You have yet to justify what you're asking. I almost guarantee you that navigating Hits in the way I said will be as fast as you need it to be. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LIMO problems
Hi, I want to know what library do you use for search in PPT files? POI support this? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
If its not added to the release code already, is there any reason for it being not added. Seems like many people agree that this is an important functionality of sorting. Its just that I can't get permission to use customized libraries in our company. Either we have to use the library as is or implement our own stuff. We don't want to go into the pains of maintaining the 3rd party library code whenever we migrate from one version to other. I would assume everyone would have the same problem. Is there any possibility this patch contributed by Aviran can be added to the actual release branch. Thanks Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, December 13, 2004 11:30 AM Subject: RE: sorting tokenized field The patch is very simple. What is does is it checks if the field you want to sort on is tokenized. If it is it loads the values from the documents to the sorting table. The only con in this approach is that loading the values this way is much slower than if the values where Keywords, but other than that it should work just fine. Aviran http://www.aviransplace.com -Original Message- From: Praveen Peddi [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 10:48 AM To: lucenelist Subject: Fw: sorting tokenized field Hi all, I forwarding the same email I sent before. Just wanted to try my luck again :). Thanks in advance. Praveen - Original Message - From: Praveen Peddi [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, December 10, 2004 3:33 PM Subject: Re: sorting tokenized field Since I am not aware of the lucene code much, I couldn't make much out of your patch. But is this patch already tested and proved to be efficient? If so, why can't it be merge into the lucene code and made it part of the release. I think the bug is valid. Its very likely that people want to sort on tokenized fields. If I apply this patch to lucene code and use it for myself, I will have hard time managing it in future (while upgrading lucene library). If the pathc is applied to lucene release code, it would be very easy for the lucene users. If possible, can someone explain what the path does? I am trying to understand what exactly changed but could not figrue out. Praveen - Original Message - From: Aviran [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Friday, December 10, 2004 2:30 PM Subject: RE: sorting tokenized field I have suggested a solution for this problem ( http://issues.apache.org/bugzilla/show_bug.cgi?id=30382 ) you can use the patch suggested there and recompile lucene. Aviran http://www.aviransplace.com -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Friday, December 10, 2004 13:53 PM To: Lucene Users List Subject: Re: sorting tokenized field On Dec 10, 2004, at 1:40 PM, Praveen Peddi wrote: I read that the tokenised fields cannot be sorted. In order to sort tokenized field, either the application has to duplicate field with diff name and not tokenize it or come up with something else. But shouldn't the search engine takecare of this? Are there any plans of putting this functionality built into lucene? It would be wasteful for Lucene to assume any field you add should be available for sorting. Adding one more line to your indexing code to accommodate your sorting needs seems a pretty small price to pay. Do you have suggestions to improve how this works? Or how it is documented? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting tokenized field
On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote: If its not added to the release code already, is there any reason for it being not added. As noted, there is a performance issue with sorting by tokenized fields. It would seem far more advisable for you to simply add another field used for sorting which is untokenized. Why has it not been added? There have been several committers quite active in the codebase (myself excluded). If you wish for changes to be committed, perseverance and patience are key. Keep lobbying, but do so kindly. When there are viable alternatives (such as adding an untokenized field for sorting) then certainly there is less incentive to commit changes. Lucene's codebase is pretty clean and tight - it is wise for us to be very selective about changes to it. Seems like many people agree that this is an important functionality of sorting. Many do, but not all. I'm -0 on this change, meaning I'm not veto'ing it, but I'm not actually for it given the performance issue. Its just that I can't get permission to use customized libraries in our company. No custom library is needed for you to add an untokenized field for sorting purposes. Also, sorting is extensible. Check out the Lucene in Action code, specifically the lia.extsearch.sorting.DistanceSortingTest class. Maybe you could add your own custom sorting code that could do what you want without patching Lucene. Is there any possibility this patch contributed by Aviran can be added to the actual release branch. Keep lobbying - other committers may feel differently than I do about it and add it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Hit 2 has score less than Hit 3
I have come across a scenario where the hits returned are not sorted. Or maybe they are sorted but the explanation is not correct. Take a look at http://cofferdam.cs.utexas.edu:8080/search.jsp?query=space+odysseyhitsPerPage=10hitsPerSite=0 Look at the top 3 results. Score of Hit 1 is 1.0188559 Score of Hit 2 is 0.9934416 Score of Hit 3 is 1.0188559 I can't explain how score of hit 2 can be hit 3. I thought the hits that were returned were sorted. An explanation is that the explanation of hit 2 is not correctly computed. Has anyone encountered this before? FYI, the docs corresponding to hits 1,2 and 3 have exactly the same scoring fields(By scoring fields, I mean the fields used in the query). Thanks for your time. -Vikas - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMMA
On Dec 13, 2004, at 11:16 PM, Karthik N S wrote: time [ A simple search of 'handbags' returned 1,60,000 hits and time taken was 440 secs ,in production Env / May be our Coding is poor,But we are constantly improving the process ]. If your searches are taking 440 seconds, you have something more fundamentally wrong. You are either doing some large wildcard/range/fuzzy expansions or you're accessing every document from all your hits. Is the searcher.search() method taking that long? I bet not. Or rather is it the iteration over the Hits that is killing the search time, which is what I suspect? We've emphasized numerous times that calling hits.doc(i) is a resource hit. Don't do it for documents you aren't going to show. To filter by score, use hits.score(i) first. { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and BlackDawn Java 1.4.2 with Args -XX:+UseParallelGC for Garbage Collection } Please narrow your code down to a clean, succinct example that you can post. It is difficult to help you without details of your code (but let me emphasize again - it needs to be clean and succinct so it is quick for us to get a handle on). To be One step in advance ,We also have an adjecent Fields 'Vendor ','Price' which we have to accordingly Compare Best/Poor/Least results . So We have to have to limit the hits accordingly,since Lucene API does not provide any way to inject this limiting facility *prior* to getting the hits . Ah, so you are accessing every document to get this field information. It is incorrect that you cannot filter prior to getting hits. You have a couple of options in filtering by a field value - use a QueryFilter or simply AND a RangeQuery to the original query. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HITCOLLECTOR+SCORE+DELIMMA
Hi Erik What exactly do u mean by this We've emphasized numerous times that calling hits.doc(i) is a resource hit. Don't do it for documents you aren't going to show. To filter by score, use hits.score(i) first. I am bit Confused u mean to say Replace hits.doc(i) by hits.score(i) Also Ah, so you are accessing every document to get this field information. It is incorrect that you cannot filter prior to getting hits. You have a couple of options in filtering by a field value - use a QueryFilter . or simply AND a RangeQuery to the original query. Since the portal we ar building for is a eCommerce one, We have to return SearchWord across ( 7 ) x 1000 x 15000 documents , Get most of the Relevant His (Where ever Score is between 0.5 to 1.0 ) and then Sort the adjecent Fields 'Vendors' and 'Price' in ASC Order In such a case We cannot use RangeQuery without priorly knowing what exactly the Consumer want's Is it not possible to have a Generalized Filter in further versions of API , to Inject some minor factors prior to getting the Hits returned. Thx in advance Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 3:44 PM To: Lucene Users List Subject: Re: HITCOLLECTOR+SCORE+DELIMMA On Dec 13, 2004, at 11:16 PM, Karthik N S wrote: time [ A simple search of 'handbags' returned 1,60,000 hits and time taken was 440 secs ,in production Env / May be our Coding is poor,But we are constantly improving the process ]. If your searches are taking 440 seconds, you have something more fundamentally wrong. You are either doing some large wildcard/range/fuzzy expansions or you're accessing every document from all your hits. Is the searcher.search() method taking that long? I bet not. Or rather is it the iteration over the Hits that is killing the search time, which is what I suspect? We've emphasized numerous times that calling hits.doc(i) is a resource hit. Don't do it for documents you aren't going to show. To filter by score, use hits.score(i) first. { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and BlackDawn Java 1.4.2 with Args -XX:+UseParallelGC for Garbage Collection } Please narrow your code down to a clean, succinct example that you can post. It is difficult to help you without details of your code (but let me emphasize again - it needs to be clean and succinct so it is quick for us to get a handle on). To be One step in advance ,We also have an adjecent Fields 'Vendor ','Price' which we have to accordingly Compare Best/Poor/Least results . So We have to have to limit the hits accordingly,since Lucene API does not provide any way to inject this limiting facility *prior* to getting the hits . Ah, so you are accessing every document to get this field information. It is incorrect that you cannot filter prior to getting hits. You have a couple of options in filtering by a field value - use a QueryFilter or simply AND a RangeQuery to the original query. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMMA
On Dec 14, 2004, at 5:42 AM, Karthik N S wrote: What exactly do u mean by this We've emphasized numerous times that calling hits.doc(i) is a resource hit. Don't do it for documents you aren't going to show. To filter by score, use hits.score(i) first. I am bit Confused u mean to say Replace hits.doc(i) by hits.score(i) Here is some pseudo-code: start = 0 or the starting index for the page you want to display finish = last hits index you want to display for i = start; i finish ; i++ if hits.score(i) within tolerance grab hits.doc(i) I'm working hard to be helpful here. I'm running out of answers for you though. You are ignoring my requests to actually post code. If you want further assistance shows us *exactly* what you're doing. ( 7 ) x 1000 x 15000 documents , Get most of the Relevant His (Where ever Score is between 0.5 to 1.0 ) and then Sort the adjecent Fields 'Vendors' and 'Price' in ASC Order In such a case We cannot use RangeQuery without priorly knowing what exactly the Consumer want's See above. I cannot help with this without actual code (succinct clear code!). Lucene can sort and filter if you leverage it appropriately. Please grab a copy of Lucene in Action for lots of details on sorting and filtering. Is it not possible to have a Generalized Filter in further versions of API , to Inject some minor factors prior to getting the Hits returned. This already exists. Please try it out. There have been numerous posts about this topic. Lucene in Action covers it. Our source code download has examples. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Opinions: Using Lucene as a thin database
I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose.
Re: sorting tokenized field
Hi Erik, Thanks a lot for your kind response. I appreciate the details. What I meant by custom library is, applying aviran's patch to the lucene and maintaining it, not adding an extra field. Adding an extra field was my last option if I can't use the patch. I did look at the extensible search and infact I wrote my own comparators (IgnoreCaseStringComparator and another custom comparator) and they work just fine. But I am not sure if this extensible search features helps me in sorting on tokenized field w/o adding the extra field. For now, I will just go for the extra field option and later if a more optimized solution is built into lucene I can use that. Praveen - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, December 13, 2004 3:01 PM Subject: Re: sorting tokenized field On Dec 13, 2004, at 2:22 PM, Praveen Peddi wrote: If its not added to the release code already, is there any reason for it being not added. As noted, there is a performance issue with sorting by tokenized fields. It would seem far more advisable for you to simply add another field used for sorting which is untokenized. Why has it not been added? There have been several committers quite active in the codebase (myself excluded). If you wish for changes to be committed, perseverance and patience are key. Keep lobbying, but do so kindly. When there are viable alternatives (such as adding an untokenized field for sorting) then certainly there is less incentive to commit changes. Lucene's codebase is pretty clean and tight - it is wise for us to be very selective about changes to it. Seems like many people agree that this is an important functionality of sorting. Many do, but not all. I'm -0 on this change, meaning I'm not veto'ing it, but I'm not actually for it given the performance issue. Its just that I can't get permission to use customized libraries in our company. No custom library is needed for you to add an untokenized field for sorting purposes. Also, sorting is extensible. Check out the Lucene in Action code, specifically the lia.extsearch.sorting.DistanceSortingTest class. Maybe you could add your own custom sorting code that could do what you want without patching Lucene. Is there any possibility this patch contributed by Aviran can be added to the actual release branch. Keep lobbying - other committers may feel differently than I do about it and add it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
I don't have the requirement to do range type select, i.e. the only operator I would need is the equals. Select * from MY_TABLE where MY_NUMERIC_FIELD = 80. My fields that are searchable in my model are always type KEYWORD. I believe this forces the match to be exact. So thinking about it in anything other than equals terms, I believe, would be a mistake. In any case, I believe that the requirement to use Lucene as a thin DB means that your requirements for your database select are fairly simple and straightforward. KLCobb -Original Message- From: Akmal Sarhan [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 10:24 AM To: Lucene Users List Subject: Re: Opinions: Using Lucene as a thin database that sounds very interesting but how do you handle queries like select * from MY_TABLE where MY_NUMERIC_FIELD 80 as far as I know you have only the range query so you will have to say my_numeric_filed:[80 TO ??] but this would not work in the a/m example or am I missing something? regards Akmal Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07: Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to the speed. I found the lucene search on a millions objects is 4 to 5 times faster than our oracle queries (ofcourse this might be due to our pitiful database design :) ). It works great so far. the only caveat that we had till now was incremental updates. But now I am implementing real-time updates so that the data in lucene index is almost always in sync with data in database. So now, our search does not goto the database at all. Praveen - Original Message - From: Kevin L. Cobb [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:40 AM Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] !EXCUBATOR:41bf0221115901292611315! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
On Dec 14, 2004, at 9:40 AM, Kevin L. Cobb wrote: I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. I use Lucene as the complete data storage for my blog at http://www.blogscene.org/erik - all HTTP requests map to a Lucene query (based on the path and optional query parameter). I've been lame and have never put any caching in there. I'm about to start a new project that really needs a relational database under the covers, but I'm cringing at the headaches involved compared to the joys of using Lucene. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search HTML Files
I've been trying the demo apps of Lucene for searching in html files, I want to know what problems or options are not implemented in this web aplication. thks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HITCOLLECTOR+SCORE+DELIMA
On Dec 13, 2004, at 1:16 AM, Karthik N S wrote: So u say I have to Build a Filter to Collect all the Scores between the 2 Ranges [ 0.2f to 1.0f] My message is being misinterpreted. I said filter as a verb, not a noun. :) In other words, I was not intending to mean write a Filter - a Filter would not be able to filter on score. so the API for the same would be Hits hit = search(Query query, Filter filtertoGetScore) But while writing the Filter Score again depends on Hits Score = hits.score(x); Again, you cannot write a Filter (capital 'F') to deal with score. Please re-read what I said below... Hits are in descending score order, so you may just want to use Hits and filter based on the score provided by hits.score(i). Iterate over Hits... when you encounter scores below your desired range, stop iterating. Why is this simple procedure not good enough for what you are trying to achieve? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LIMO problems
Daniel Cortes wrote: Hi, I want to know what library do you use for search in PPT files? I use this (native code): http://chicago.sourceforge.net/xlhtml POI support this? thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: HITCOLLECTOR+SCORE+DELIMMA
Hi Erik Apologies... In this Mailed http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED] che.orgmsgNo=11254 I have already told u that doc.get( ); was coming in batches for a mear hit of '4000' , and this is happening in real time [ A simple search of 'handbags' returned 1,60,000 hits and time taken was 440 secs ,in production Env / May be our Coding is poor,But we are constantly improving the process ]. { O/s Linux Gentoo , RAM 1GB, Lucene1.4.1,Appserver = Tomcat5, and BlackDawn Java 1.4.2 with Args -XX:+UseParallelGC for Garbage Collection } To be One step in advance ,We also have an adjecent Fields 'Vendor ','Price' which we have to accordingly Compare Best/Poor/Least results . So We have to have to limit the hits accordingly,since Lucene API does not provide any way to inject this limiting facility *prior* to getting the hits . [ Excuse me Nader Henein ,I am from a Lucene-Users Form NOT in Lucene-Developer's Form, So we expect a Least possible Help ] With Warm Regards Karthik -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Monday, December 13, 2004 6:39 PM To: Lucene Users List Subject: Re: HITCOLLECTOR+SCORE+DELIMMA On Dec 13, 2004, at 6:58 AM, Karthik N S wrote: Iterate over Hits. returns large hit values and Iteration on Hits for scores consumes time , so How Do I Limit my Search Between [ X.xf to Y.yf ] prior getting the Hits. Why do you need to do this *prior* to getting Hits? You have yet to justify what you're asking. I almost guarantee you that navigating Hits in the way I said will be as fast as you need it to be. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Sorting based on calculations at search time
Ah! It makes sense now... Thanks for the clarification Hoss. I think it'll work in my case as I need to perform this calculation for every search... -Guru. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chris Hostetter Sent: Friday, December 10, 2004 10:21 PM To: Lucene Users List Subject: RE: Sorting based on calculations at search time : I believe you are talking about the boost factor for fields or documents : while searching. That does not apply in my case - maybe I am missing a : point here. : The weight field I was talking about is only for the calculation Otis is suggesting that you set the boost of the document to be your weight value. That way Lucene will automaticly do your multiplucation calculation when determining the score The down side of this, is that i don't think there's anyway to keep it from influencing the score on every search, so it's not something you could use only on some queries. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]