RE: Relevance percentage
Thanks much for the reply. Chuck Williams <[EMAIL PROTECTED]> wrote:The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is > 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck > -Original Message- > From: Gururaja H [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 6:10 AM > To: Lucene Users List; Mike Snare > Subject: Re: Relevance percentage > > Hi, > > But, How to calculate the coord() fraction ? I know by default, > in DefaultSimilarity the coord() fraction is defined as below: > > /** Implemented as overlap / maxOverlap. */ > > public float coord(int overlap, int maxOverlap) { > > return overlap / (float)maxOverlap; > > } > How to get the overlap and maxOverlap value in each of the matched > document(s) ? > > Thanks, > Gururaja > > Mike Snare wrote: > I'm still new to Lucene, but wouldn't that be the coord()? My > understanding is that the coord() is the fraction of the boolean query > that matched a given document. > > Again, I'm new, so somebody else will have to confirm or deny... > > -Mike > > > On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H > wrote: > > How to find out the percentages of matched terms in the document(s) > using Lucene ? > > Here is an example, of what i am trying to do: > > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there > are 4 matching > > documents with the following attributes: > > Doc#1: contains terms(ibm,drive) > > Doc#2: contains terms(ibm,risc, tape, drive) > > Doc#3: contains terms(ibm,risc, tape,drive) > > Doc#4: contains terms(ibm, risc, tape, drive, manual). > > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) > and 40% > > (doc#1). > > > > Any help on how to go about doing this ? > > > > Thanks, > > Gururaja > > > > > > - > > Do you Yahoo!? > > Send a seasonal email greeting and help others. Do good. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Relevance percentage
Thanks much for the reply. Paul Elschot <[EMAIL PROTECTED]> wrote:On Monday 20 December 2004 15:09, Gururaja H wrote: > Hi, > > But, How to calculate the coord() fraction ? I know by default, > in DefaultSimilarity the coord() fraction is defined as below: > > /** Implemented as overlap / maxOverlap. */ > > public float coord(int overlap, int maxOverlap) { > > return overlap / (float)maxOverlap; > > } > How to get the overlap and maxOverlap value in each of the matched document(s) ? In case you only want the coordination factor to have more influence in the order of your search results you can use a Similarity with a coord() function that has a power higher than 1: public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } I'd first try values between 3.0f and 5.0f for SOME_POWER. The searching code precomputes all coord values once per query per search, so there is no need to worry about the computing efficiency. This has the advantage that the other scoring factors are still used for ranking. Since the other factors can vary quite a bit, it is difficult to guarantee that any coord() implementation will provide a score that sorts by the number of matching clauses. Higher powers as above can come a long way, though. Regards, Paul Elschot > Thanks, > Gururaja > > Mike Snare wrote: > I'm still new to Lucene, but wouldn't that be the coord()? My > understanding is that the coord() is the fraction of the boolean query > that matched a given document. > > Again, I'm new, so somebody else will have to confirm or deny... > > -Mike > > > On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H > wrote: > > How to find out the percentages of matched terms in the document(s) using Lucene ? > > Here is an example, of what i am trying to do: > > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching > > documents with the following attributes: > > Doc#1: contains terms(ibm,drive) > > Doc#2: contains terms(ibm,risc, tape, drive) > > Doc#3: contains terms(ibm,risc, tape,drive) > > Doc#4: contains terms(ibm, risc, tape, drive, manual). > > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% > > (doc#1). > > > > Any help on how to go about doing this ? > > > > Thanks, > > Gururaja > > > > > > - > > Do you Yahoo!? > > Send a seasonal email greeting and help others. Do good. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > All your favorites on one personal page Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less.
RE: Relevance and ranking ...
Hi Chuck Williams, Paul Elschot, Thanks so much for the reply. By overriding the coord() as follows, able to get the right order for the example that i gave in this thread. public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } Using 2.0f for SOME_POWER. As Chuck Williams suggested i am trying more example cases. Thanks, Again. Gururaja Chuck Williams <[EMAIL PROTECTED]> wrote: I believe your sole problem is that you need to tone down your lengthNorm. Because doc4 is 10 times longer than doc2, its lengthNorm is less than 1/3 of that of doc2 (1/sqrt(10) to be precise). This is a larger effect than the higher coord factor (1/.8) and the extra matching term in doc4. In your original description, it sounds like you want coord() to dominate lengthNorm(), with lengthNorm() just being used as a tie-breaker among queries with the same coord(). To achieve this, you need to reduce the impact of the lengthNorm() differences, by changing the sqrt() function in the computation of lengthNorm to something much flatter. E.g., you might use: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.log10(1000+numTerms)); } I'm not sure whether that specific formula will work, but you can find one that will by adjusting the base of the logarithm and the additive constant (1000 in the example). Some general things: 1. You need to reindex when you change the Similarity (it is used for indexing and searching -- e.g., the lengthNorm's are computed at index time). 2. Be careful not to overtune your scoring for just one example. Try many examples. You won't be able to get it perfect -- the idea is to get close to your subjective judgments as frequently as possible. 3. The idea here is to find a value of lengthNorm() that doesn't override coord, but still provides the tie-breaking you are looking for (doc2 ahead of doc3). Chuck > -Original Message- > From: Gururaja H [mailto:[EMAIL PROTECTED] > Sent: Sunday, December 19, 2004 10:10 PM > To: Lucene Users List > Subject: RE: Relevance and ranking ... > > Chuck Williams, > > Thanks for the reply. Source code and Output are below. > > Please give me your inputs. > > Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1. > Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1. > > Let me know, if you need more information. > > NOTE: Using Luene "Query" object not BooleanQuery. > > Here is the source code: > > Searcher searcher = new IndexSearcher("index"); > > Analyzer analyzer = new StandardAnalyzer(); > BufferedReader in = new BufferedReader(new > InputStreamReader(System.in)); > System.out.print("Query: "); > String line = in.readLine(); > Query query = QueryParser.parse(line, "contents", analyzer); > System.out.println("Searching for: " + query.toString("contents")); > Hits hits = searcher.search(query); > System.out.println(hits.length() + " total matching documents"); > for (int i = start; i < hits.length(); i++) { > Document doc = hits.doc(i); > System.out.print("Score is: "+ hits.score(i)); > // Use whatever your fields are here: > System.out.print(" title:"); > System.out.print(doc.get("title")); > System.out.print(" description:"); > System.out.println(doc.get("description")); > // End of fields > System.out.println(searcher.explain(query, hits.id(i))); > //System.out.println("Score of the document is: "+hits.score(i)); > String path = doc.get("path"); > if (path != null) { > System.out.println(i + ". " + path); > System.out.println("--"); > } > --- > > > Here is the output from the program: > > Query: ibm risc tape drive manual > > Searching for: ibm risc tape drive manual > > 4 total matching documents > > Score is: 0.16266039 title:null description:null > > 0.16266039 = product of: > > 0.20332548 = sum of: > > 0.03826245 = weight(contents:ibm in 1), product of: > > 0.31521872 = queryWeight(contents:ibm), product of: > > 0.7768564 = idf(docFreq=4) > > 0.40576187 = queryNorm > > 0.121383816 = fieldWeight(contents:ibm in 1), product of: > > 1.0 = tf(termFreq(contents:ibm)=1) > > 0.7768564 = idf(docFreq=4) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.06340029 = weight(contents:risc in 1), product of: > > 0.40576187 = queryWeight(contents:risc), product of: > > 1.0 = idf(docFreq=3) > > 0.40576187 = queryNorm > > 0.15625 = fieldWeight(contents:risc in 1), product of: > > 1.0 = tf(termFreq(contents:risc)=1) > > 1.0 = idf(docFreq=3) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.06340029 = weight(contents:tape in 1), product of: > > 0.40576187 = queryWeight(contents:tape), product of: > > 1.0 = idf(docFreq=3) > > 0.40576187 = queryNorm > > 0.15625 = fieldWeight(contents:tape in 1), product of: > > 1.0 = tf(termFreq(contents:tape)=1) > > 1.0 = idf(docFreq=3) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.03826245 = weight(contents:drive in 1), product of: >
index size doubled?
I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: determination of matching hits
This is not the official recommendation, but I'd suggest you are least consider: http://issues.apache.org/bugzilla/show_bug.cgi?id=32674 If you're not using Java 1.5 and you decide you want to use it, you'd need to take out those dependencies. If you improve it, please share. Chuck > -Original Message- > From: Christiaan Fluit [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 2:51 PM > To: Lucene Users List > Subject: Re: determination of matching hits > > ok, I feel a bit stupid now ;) Turns out this issue has been discussed a > while ago on both mailing lists and I even participated in one of > them... shame on me. > > The problem is indeed in how MFQP parses my query: the query A -B > becomes: > > (text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A > -summary:B) (agent:A -agent:B) > > whereas I intuitively expexted it to be evaluated as "A in any field and > not B in any field". When I use a normal QueryParser and let it use a > single field only, everything works as expected. > > Browsing the lists archives I see that there were some efforts from > different people in solving this issue, but I'm a bit confused about the > final outcome. Was this solved in the MFQP in 1.4.3? If not, what > alternative implementation of MFPQ can I currently use best? > > > Kind regards, > > Chris > -- > > Erik Hatcher wrote: > > Christian, > > > > Please simplify your situation. Use a plain TermQuery for "B" and see > > what is returned. Then use a simple BooleanQuery for "A -B". I > suspect > > MultiFieldQueryParser is the culprit. What does the toString of the > > generated Query return? MFQP is known to be trouble, and an overhaul > to > > it has been contributed recently. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: determination of matching hits
ok, I feel a bit stupid now ;) Turns out this issue has been discussed a while ago on both mailing lists and I even participated in one of them... shame on me. The problem is indeed in how MFQP parses my query: the query A -B becomes: (text:A -text:B) (title:A -title:B) (path:A -path:B) (summary:A -summary:B) (agent:A -agent:B) whereas I intuitively expexted it to be evaluated as "A in any field and not B in any field". When I use a normal QueryParser and let it use a single field only, everything works as expected. Browsing the lists archives I see that there were some efforts from different people in solving this issue, but I'm a bit confused about the final outcome. Was this solved in the MFQP in 1.4.3? If not, what alternative implementation of MFPQ can I currently use best? Kind regards, Chris -- Erik Hatcher wrote: Christian, Please simplify your situation. Use a plain TermQuery for "B" and see what is returned. Then use a simple BooleanQuery for "A -B". I suspect MultiFieldQueryParser is the culprit. What does the toString of the generated Query return? MFQP is known to be trouble, and an overhaul to it has been contributed recently. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: analyzer effecting phrases?
On Dec 20, 2004, at 12:43 PM, Peter Posselt Vestergaard wrote: Therefore I turned back to the standard analyzer and now do some replacing of the underscores in my ID string to avoid my original problem. This solved my phrase problem so that I can now search for phrases. However I still have the problem with ",.:" described above. As far as I can see the StandardAnalyzer (the StandardTokenizer that is) should tokenize words without the ",.:" characters. Am I mistaken? Is there a tokenizer that will do this? StandardAnalyzer does tokenize without ",.:", though it will keep domain names together. Here's an example: $ ant -emacs AnalyzerDemo Buildfile: build.xml AnalyzerDemo: Demonstrates analysis of sample text. Refer to the "Analysis" chapter for much more on this extremely crucial topic. Press return to continue... String to analyze: [This string will be analyzed.] Example with commas, colons, and dots. You can get this code from http://www.lucenebook.com Running lia.analysis.AnalyzerDemo... Analyzing "Example with commas, colons, and dots. You can get this code from http://www.lucenebook.com"; WhitespaceAnalyzer: [Example] [with] [commas,] [colons,] [and] [dots.] [You] [can] [get] [this] [code] [from] [http://www.lucenebook.com] SimpleAnalyzer: [example] [with] [commas] [colons] [and] [dots] [you] [can] [get] [this] [code] [from] [http] [www] [lucenebook] [com] StopAnalyzer: [example] [commas] [colons] [dots] [you] [can] [get] [code] [from] [http] [www] [lucenebook] [com] StandardAnalyzer: [example] [commas] [colons] [dots] [you] [can] [get] [code] [from] [http] [www.lucenebook.com] BUILD SUCCESSFUL Total time: 7 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
On Monday 20 December 2004 15:09, Gururaja H wrote: > Hi, > > But, How to calculate the coord() fraction ? I know by default, > in DefaultSimilarity the coord() fraction is defined as below: > > /** Implemented as overlap / maxOverlap. */ > > public float coord(int overlap, int maxOverlap) { > > return overlap / (float)maxOverlap; > > } > How to get the overlap and maxOverlap value in each of the matched document(s) ? In case you only want the coordination factor to have more influence in the order of your search results you can use a Similarity with a coord() function that has a power higher than 1: public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } I'd first try values between 3.0f and 5.0f for SOME_POWER. The searching code precomputes all coord values once per query per search, so there is no need to worry about the computing efficiency. This has the advantage that the other scoring factors are still used for ranking. Since the other factors can vary quite a bit, it is difficult to guarantee that any coord() implementation will provide a score that sorts by the number of matching clauses. Higher powers as above can come a long way, though. Regards, Paul Elschot > Thanks, > Gururaja > > Mike Snare <[EMAIL PROTECTED]> wrote: > I'm still new to Lucene, but wouldn't that be the coord()? My > understanding is that the coord() is the fraction of the boolean query > that matched a given document. > > Again, I'm new, so somebody else will have to confirm or deny... > > -Mike > > > On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H > wrote: > > How to find out the percentages of matched terms in the document(s) using Lucene ? > > Here is an example, of what i am trying to do: > > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching > > documents with the following attributes: > > Doc#1: contains terms(ibm,drive) > > Doc#2: contains terms(ibm,risc, tape, drive) > > Doc#3: contains terms(ibm,risc, tape,drive) > > Doc#4: contains terms(ibm, risc, tape, drive, manual). > > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% > > (doc#1). > > > > Any help on how to go about doing this ? > > > > Thanks, > > Gururaja > > > > > > - > > Do you Yahoo!? > > Send a seasonal email greeting and help others. Do good. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > All your favorites on one personal page Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance and ranking ...
I believe your sole problem is that you need to tone down your lengthNorm. Because doc4 is 10 times longer than doc2, its lengthNorm is less than 1/3 of that of doc2 (1/sqrt(10) to be precise). This is a larger effect than the higher coord factor (1/.8) and the extra matching term in doc4. In your original description, it sounds like you want coord() to dominate lengthNorm(), with lengthNorm() just being used as a tie-breaker among queries with the same coord(). To achieve this, you need to reduce the impact of the lengthNorm() differences, by changing the sqrt() function in the computation of lengthNorm to something much flatter. E.g., you might use: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.log10(1000+numTerms)); } I'm not sure whether that specific formula will work, but you can find one that will by adjusting the base of the logarithm and the additive constant (1000 in the example). Some general things: 1. You need to reindex when you change the Similarity (it is used for indexing and searching -- e.g., the lengthNorm's are computed at index time). 2. Be careful not to overtune your scoring for just one example. Try many examples. You won't be able to get it perfect -- the idea is to get close to your subjective judgments as frequently as possible. 3. The idea here is to find a value of lengthNorm() that doesn't override coord, but still provides the tie-breaking you are looking for (doc2 ahead of doc3). Chuck > -Original Message- > From: Gururaja H [mailto:[EMAIL PROTECTED] > Sent: Sunday, December 19, 2004 10:10 PM > To: Lucene Users List > Subject: RE: Relevance and ranking ... > > Chuck Williams, > > Thanks for the reply. Source code and Output are below. > > Please give me your inputs. > > Default document order i am getting is: Doc#2, Doc#4, Doc#3, Doc#1. > Document order needed is: Doc#4, Doc#2, Doc#3, Doc#1. > > Let me know, if you need more information. > > NOTE: Using Luene "Query" object not BooleanQuery. > > Here is the source code: > > Searcher searcher = new IndexSearcher("index"); > > Analyzer analyzer = new StandardAnalyzer(); > BufferedReader in = new BufferedReader(new > InputStreamReader(System.in)); > System.out.print("Query: "); > String line = in.readLine(); > Query query = QueryParser.parse(line, "contents", analyzer); > System.out.println("Searching for: " + query.toString("contents")); > Hits hits = searcher.search(query); > System.out.println(hits.length() + " total matching documents"); >for (int i = start; i < hits.length(); i++) { > Document doc = hits.doc(i); >System.out.print("Score is: "+ hits.score(i)); >// Use whatever your fields are here: >System.out.print(" title:"); >System.out.print(doc.get("title")); >System.out.print(" description:"); >System.out.println(doc.get("description")); >// End of fields >System.out.println(searcher.explain(query, hits.id(i))); > //System.out.println("Score of the document is: "+hits.score(i)); > String path = doc.get("path"); > if (path != null) { > System.out.println(i + ". " + path); > System.out.println("--"); > } > --- > > > Here is the output from the program: > > Query: ibm risc tape drive manual > > Searching for: ibm risc tape drive manual > > 4 total matching documents > > Score is: 0.16266039 title:null description:null > > 0.16266039 = product of: > > 0.20332548 = sum of: > > 0.03826245 = weight(contents:ibm in 1), product of: > > 0.31521872 = queryWeight(contents:ibm), product of: > > 0.7768564 = idf(docFreq=4) > > 0.40576187 = queryNorm > > 0.121383816 = fieldWeight(contents:ibm in 1), product of: > > 1.0 = tf(termFreq(contents:ibm)=1) > > 0.7768564 = idf(docFreq=4) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.06340029 = weight(contents:risc in 1), product of: > > 0.40576187 = queryWeight(contents:risc), product of: > > 1.0 = idf(docFreq=3) > > 0.40576187 = queryNorm > > 0.15625 = fieldWeight(contents:risc in 1), product of: > > 1.0 = tf(termFreq(contents:risc)=1) > > 1.0 = idf(docFreq=3) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.06340029 = weight(contents:tape in 1), product of: > > 0.40576187 = queryWeight(contents:tape), product of: > > 1.0 = idf(docFreq=3) > > 0.40576187 = queryNorm > > 0.15625 = fieldWeight(contents:tape in 1), product of: > > 1.0 = tf(termFreq(contents:tape)=1) > > 1.0 = idf(docFreq=3) > > 0.15625 = fieldNorm(field=contents, doc=1) > > 0.03826245 = weight(contents:drive in 1), product of: > > 0.31521872 = queryWeight(contents:drive), product of: > > 0.7768564 = idf(docFreq=4) > > 0.4
RE: analyzer effecting phrases?
Hi again Thanks for your answer, Otis. My analyzer did not do anything else than using the WhitespaceAnalyzer/LowerCaseFilter. However I found out that I got problems with characters such as ",.:" when searching because of my simple analyzer. (E.g. I would not be able to search for "world" in the string "Hello world." as . became part of the last word). Therefore I turned back to the standard analyzer and now do some replacing of the underscores in my ID string to avoid my original problem. This solved my phrase problem so that I can now search for phrases. However I still have the problem with ",.:" described above. As far as I can see the StandardAnalyzer (the StandardTokenizer that is) should tokenize words without the ",.:" characters. Am I mistaken? Is there a tokenizer that will do this? Thanks for the help! Regards Peter > Date: Mon, 20 Dec 2004 08:19:42 -0800 (PST) > From: Otis Gospodnetic <[EMAIL PROTECTED]> > Subject: analyzer effecting phrases? > Content-Type: text/plain; charset=us-ascii > > > When searching for phrases, what's important is the position of each > token/word extracted by the Analyzer. > WhitespaceAnalyzer/LowerCaseFilter don't do anything with the > positional information. There is nothing else in your Analyzer? > > In any case, the following should help you see what your Analyzer is > doing: > http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can > augment the code there to provide positional information, too. > > Otis > > -Original Message- > From: Peter Posselt Vestergaard [mailto:[EMAIL PROTECTED] > Sent: 20. december 2004 15:24 > To: '[EMAIL PROTECTED]' > Subject: analyzer effecting phrases? > > Hi > I am building an index of texts, each related to a unique id. > The unique ids might contain a number of underscores which > will make the standardanalyzer shorten them after it sees the > second underscore in a row. Furthermore many of the texts I > am indexing is in Italian so the removal of 'trivial' words > done by the standard analyzer is not necessarily meaningful > for these texts. Therefore I am instead using an analyzer > made from the WhitespaceTokenizer and the LowerCaseFilter. > This works fine for me until I try searching for a phrase. I > am searching for a simple phrase containing two words and > with double-quotes around it. I have found the phrase in one > of the texts so I know it should return at least one result, > but none is found. If I remove the double-quotes and searches > for the 2 words with AND between them I do find the story. > Can anyone tell me if this is an obvious (side-)effect of not > using the standard analyzer? And is there a better solution > to my problem than using the very simple analyzer? > Best regards > Peter Vestergaard > PS: I use the same analyzer for both searching and indexing > (of course). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Relevance percentage
The coord() value is not saved anywhere so you would need to recompute it. You could either call explain() and parse the result string, or better, look at explain() and implement what it does more efficiently just for coord(). If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is > 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). I'm on the West Coast of the U.S. so evidently on a very different time zone from you -- will look at your other message next. Chuck > -Original Message- > From: Gururaja H [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 6:10 AM > To: Lucene Users List; Mike Snare > Subject: Re: Relevance percentage > > Hi, > > But, How to calculate the coord() fraction ? I know by default, > in DefaultSimilarity the coord() fraction is defined as below: > > /** Implemented as overlap / maxOverlap. */ > > public float coord(int overlap, int maxOverlap) { > > return overlap / (float)maxOverlap; > > } > How to get the overlap and maxOverlap value in each of the matched > document(s) ? > > Thanks, > Gururaja > > Mike Snare <[EMAIL PROTECTED]> wrote: > I'm still new to Lucene, but wouldn't that be the coord()? My > understanding is that the coord() is the fraction of the boolean query > that matched a given document. > > Again, I'm new, so somebody else will have to confirm or deny... > > -Mike > > > On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H > wrote: > > How to find out the percentages of matched terms in the document(s) > using Lucene ? > > Here is an example, of what i am trying to do: > > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there > are 4 matching > > documents with the following attributes: > > Doc#1: contains terms(ibm,drive) > > Doc#2: contains terms(ibm,risc, tape, drive) > > Doc#3: contains terms(ibm,risc, tape,drive) > > Doc#4: contains terms(ibm, risc, tape, drive, manual). > > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) > and 40% > > (doc#1). > > > > Any help on how to go about doing this ? > > > > Thanks, > > Gururaja > > > > > > - > > Do you Yahoo!? > > Send a seasonal email greeting and help others. Do good. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > Do you Yahoo!? > All your favorites on one personal page - Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sorting on a field that can have null values
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. "Order by" in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- "The Leader in Enterprise Content Integration"
sorting on a field that can have null values
Hi all, I am getting null pointer exception when I am sorting on a field that has null value for some documents. "Order by" in sql does work on such fields and I think it puts all results with null values at the end of the list. Shouldn't lucene also do the same thing instead of throwing null pointer exception. Is this an expected behaviour? Is lucene always expecting some value on the sortable fields? I thought of putting empty strings instead of null values but I think empty strings are put first in the list while sorting which is the reverse of what anyone would want. Following is the exception I saw in the error log: java.lang.NullPointerException at org.apache.lucene.search.SortComparator$1.compare(Lorg.apache.lucene.search.ScoreDoc;Lorg.apache.lucene.search.ScoreDoc;)I(SortComparator.java:36) at org.apache.lucene.search.FieldSortedHitQueue.lessThan(Ljava.lang.Object;Ljava.lang.Object;)Z(FieldSortedHitQueue.java:95) at org.apache.lucene.util.PriorityQueue.upHeap()V(PriorityQueue.java:120) at org.apache.lucene.util.PriorityQueue.put(Ljava.lang.Object;)V(PriorityQueue.java:47) at org.apache.lucene.util.PriorityQueue.insert(Ljava.lang.Object;)Z(PriorityQueue.java:58) at org.apache.lucene.search.IndexSearcher$2.collect(IF)V(IndexSearcher.java:130) at org.apache.lucene.search.Scorer.score(Lorg.apache.lucene.search.HitCollector;)V(Scorer.java:38) at org.apache.lucene.search.IndexSearcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;ILorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.TopFieldDocs;(IndexSearcher.java:125) at org.apache.lucene.search.Hits.getMoreDocs(I)V(Hits.java:64) at org.apache.lucene.search.Hits.(Lorg.apache.lucene.search.Searcher;Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Filter;Lorg.apache.lucene.search.Sort;)V(Hits.java:51) at org.apache.lucene.search.Searcher.search(Lorg.apache.lucene.search.Query;Lorg.apache.lucene.search.Sort;)Lorg.apache.lucene.search.Hits;(Searcher.java:41) If its a bug in lucene, Will it be fixed in next release? Any suggestions would be appreciated. Praveen ** Praveen Peddi Sr Software Engg, Context Media, Inc. email:[EMAIL PROTECTED] Tel: 401.854.3475 Fax: 401.861.3596 web: http://www.contextmedia.com ** Context Media- "The Leader in Enterprise Content Integration"
Re: determination of matching hits
Christian, Please simplify your situation. Use a plain TermQuery for "B" and see what is returned. Then use a simple BooleanQuery for "A -B". I suspect MultiFieldQueryParser is the culprit. What does the toString of the generated Query return? MFQP is known to be trouble, and an overhaul to it has been contributed recently. Erik On Dec 20, 2004, at 10:32 AM, Christiaan Fluit wrote: Hello all, I have a question regarding the determination of the set of matching documents, in particular (I guess) related to the NOT operator. In my case I have a document containing the terms A and B. When I query for either A or for B, I get this document back, just as expected. Now when I query for A -B, I once again get this document back. In other words: this document matches both B and a query containing the clause -B, which theoretically should never happen. I've seen this happen with various keywords, sometimes with multiple "conflicting" documents. In each case, the B query returned the document with a very low relevance (e.g. 0.007...). Based on these low relevancies and a quick peek in the Lucene code, I strongly suspect that this is caused by rounding errors, as it seems to me that floating point numbers are used to both express the membership of a set as well as its score. Can somebody confirm this? And if this is the case, is there a workaround to eliminate or at least significantly suppress this problem? A colleague mentioned boosting every term in a query, would this solve anything? For most search engine-like applications, which order documents on relevance, I think this problem is not a real issue since such conflicting documents appear at the end of the result list and are not likely to be seen by the user. However, in our case we have an application which displays overlaps of entire result sets and these documents show up very prominently (I can show screenshots if desired). We have already been asked by customers to explain these results :) FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every document has the same set of five fields. The above queries are parsed by MultiFieldQueryParser, using all five fields. I haven't touched the default operator, but the queries A AND -B and A AND NOT B give the same conflicting overlap in the result set. Thanks in advance, Christiaan Fluit Aduna.biz -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Queries difference
Alex, I think you want this: +city:London +city:Amsterdam +address:1_street +address:2_street Otis --- Alex Kiselevski <[EMAIL PROTECTED]> wrote: > > Thanks Morus > So if I understand right > If the seqond query is : > +city(London) +city(Amsterdam) +address(1_street) +address(2_street) > > Both queries have the same value ? > -Original Message- > From: Morus Walter [mailto:[EMAIL PROTECTED] > Sent: Monday, December 20, 2004 6:11 PM > To: Lucene Users List > Subject: Re: Queries difference > > > Alex Kiselevski writes: > > > > Hello, I want to know is there a difference between queries: > > > > +city(+London Amsterdam) +address(1_street 2_street) > > > > And > > > > +city(+London) +city(Amsterdam) +address(1_street) > +address(2_street) > > > I guess you mean city:(... and so on. > > The first query searches documents containing 'London' in city, > scoring > results also containing Amsterdam higher, and containing 1_street or > 2_street in address. The second query searches for documents > containing > both London and Amsterdam in city and 1_street and 2_street in > address. > Note the the + before London in the second query doesn't mean > anything. > > HTH > Morus > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > The information contained in this message is proprietary of Amdocs, > protected from disclosure, and may be privileged. > The information is intended to be conveyed only to the designated > recipient(s) > of the message. If the reader of this message is not the intended > recipient, > you are hereby notified that any dissemination, use, distribution or > copying of > this communication is strictly prohibited and may be unlawful. > If you have received this communication in error, please notify us > immediately > by replying to the message and deleting it from your computer. > Thank you. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Queries difference
Thanks Morus So if I understand right If the seqond query is : +city(London) +city(Amsterdam) +address(1_street) +address(2_street) Both queries have the same value ? -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: Monday, December 20, 2004 6:11 PM To: Lucene Users List Subject: Re: Queries difference Alex Kiselevski writes: > > Hello, I want to know is there a difference between queries: > > +city(+London Amsterdam) +address(1_street 2_street) > > And > > +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) > I guess you mean city:(... and so on. The first query searches documents containing 'London' in city, scoring results also containing Amsterdam higher, and containing 1_street or 2_street in address. The second query searches for documents containing both London and Amsterdam in city and 1_street and 2_street in address. Note the the + before London in the second query doesn't mean anything. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: analyzer effecting phrases?
When searching for phrases, what's important is the position of each token/word extracted by the Analyzer. WhitespaceAnalyzer/LowerCaseFilter don't do anything with the positional information. There is nothing else in your Analyzer? In any case, the following should help you see what your Analyzer is doing: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis and you can augment the code there to provide positional information, too. Otis --- Peter Posselt Vestergaard <[EMAIL PROTECTED]> wrote: > Hi > I am building an index of texts, each related to a unique id. The > unique ids > might contain a number of underscores which will make the > standardanalyzer > shorten them after it sees the second underscore in a row. > Furthermore many > of the texts I am indexing is in Italian so the removal of 'trivial' > words > done by the standard analyzer is not necessarily meaningful for these > texts. > Therefore I am instead using an analyzer made from the > WhitespaceTokenizer > and the LowerCaseFilter. > This works fine for me until I try searching for a phrase. I am > searching > for a simple phrase containing two words and with double-quotes > around it. I > have found the phrase in one of the texts so I know it should return > at > least one result, but none is found. If I remove the double-quotes > and > searches for the 2 words with AND between them I do find the story. > Can anyone tell me if this is an obvious (side-)effect of not using > the > standard analyzer? And is there a better solution to my problem than > using > the very simple analyzer? > Best regards > Peter Vestergaard > PS: I use the same analyzer for both searching and indexing (of > course). > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Queries difference
Alex Kiselevski writes: > > Hello, I want to know is there a difference between queries: > > +city(+London Amsterdam) +address(1_street 2_street) > > And > > +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) > I guess you mean city:(... and so on. The first query searches documents containing 'London' in city, scoring results also containing Amsterdam higher, and containing 1_street or 2_street in address. The second query searches for documents containing both London and Amsterdam in city and 1_street and 2_street in address. Note the the + before London in the second query doesn't mean anything. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Queries difference
Hello, I want to know is there a difference between queries: +city(+London Amsterdam) +address(1_street 2_street) And +city(+London) +city(Amsterdam) +address(1_street) +address(2_street) Thanks in advance Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 R&D, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and may be privileged. The information is intended to be conveyed only to the designated recipient(s) of the message. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you.
determination of matching hits
Hello all, I have a question regarding the determination of the set of matching documents, in particular (I guess) related to the NOT operator. In my case I have a document containing the terms A and B. When I query for either A or for B, I get this document back, just as expected. Now when I query for A -B, I once again get this document back. In other words: this document matches both B and a query containing the clause -B, which theoretically should never happen. I've seen this happen with various keywords, sometimes with multiple "conflicting" documents. In each case, the B query returned the document with a very low relevance (e.g. 0.007...). Based on these low relevancies and a quick peek in the Lucene code, I strongly suspect that this is caused by rounding errors, as it seems to me that floating point numbers are used to both express the membership of a set as well as its score. Can somebody confirm this? And if this is the case, is there a workaround to eliminate or at least significantly suppress this problem? A colleague mentioned boosting every term in a query, would this solve anything? For most search engine-like applications, which order documents on relevance, I think this problem is not a real issue since such conflicting documents appear at the end of the result list and are not likely to be seen by the user. However, in our case we have an application which displays overlaps of entire result sets and these documents show up very prominently (I can show screenshots if desired). We have already been asked by customers to explain these results :) FYI, in case it may be relevant: I'm still using Lucene 1.4.2. Every document has the same set of five fields. The above queries are parsed by MultiFieldQueryParser, using all five fields. I haven't touched the default operator, but the queries A AND -B and A AND NOT B give the same conflicting overlap in the result set. Thanks in advance, Christiaan Fluit Aduna.biz -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
analyzer effecting phrases?
Hi I am building an index of texts, each related to a unique id. The unique ids might contain a number of underscores which will make the standardanalyzer shorten them after it sees the second underscore in a row. Furthermore many of the texts I am indexing is in Italian so the removal of 'trivial' words done by the standard analyzer is not necessarily meaningful for these texts. Therefore I am instead using an analyzer made from the WhitespaceTokenizer and the LowerCaseFilter. This works fine for me until I try searching for a phrase. I am searching for a simple phrase containing two words and with double-quotes around it. I have found the phrase in one of the texts so I know it should return at least one result, but none is found. If I remove the double-quotes and searches for the 2 words with AND between them I do find the story. Can anyone tell me if this is an obvious (side-)effect of not using the standard analyzer? And is there a better solution to my problem than using the very simple analyzer? Best regards Peter Vestergaard PS: I use the same analyzer for both searching and indexing (of course). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as overlap / maxOverlap. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? Thanks, Gururaja Mike Snare <[EMAIL PROTECTED]> wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: > How to find out the percentages of matched terms in the document(s) using > Lucene ? > Here is an example, of what i am trying to do: > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 > matching > documents with the following attributes: > Doc#1: contains terms(ibm,drive) > Doc#2: contains terms(ibm,risc, tape, drive) > Doc#3: contains terms(ibm,risc, tape,drive) > Doc#4: contains terms(ibm, risc, tape, drive, manual). > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% > (doc#1). > > Any help on how to go about doing this ? > > Thanks, > Gururaja > > > - > Do you Yahoo!? > Send a seasonal email greeting and help others. Do good. > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page Try My Yahoo!
Re: Relevance percentage
I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H <[EMAIL PROTECTED]> wrote: > How to find out the percentages of matched terms in the document(s) using > Lucene ? > Here is an example, of what i am trying to do: > The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 > matching > documents with the following attributes: > Doc#1: contains terms(ibm,drive) > Doc#2: contains terms(ibm,risc, tape, drive) > Doc#3: contains terms(ibm,risc, tape,drive) > Doc#4: contains terms(ibm, risc, tape, drive, manual). > The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% > (doc#1). > > Any help on how to go about doing this ? > > Thanks, > Gururaja > > > - > Do you Yahoo!? > Send a seasonal email greeting and help others. Do good. > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Number of documents
On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote: I've to show to my boss if Lucene is the best option for create a search engine of a new portal. I want to now how many documents do you have in your index? And how many bigger is your DB? I highly recommend you use Luke to examine the index. It is a great tool to have handy. It shows these statistics and many others. the types of formats who has to support the portal are html jsp txt doc pdf ppt HTML, TXT, DOC, and PDF are all quite straightforward to do. PPT is possible, perhaps POI will do the trick. JSP depends on how you want to analyze it. If any text in the file should be indexed (including JSP directives, taglibs, and HTML) then you can treat it as a text file. If you need to eliminate the tags then you'll need to parse the JSP somehow, however I strongly recommend that content not reside in JSP pages but rather in a content management system, database, or such. another question that I have is: I'm playing with the files of the book Lucene in Action and I try to use the example of handling types.The folder data contains 5 files, and created index contain five documents what the only one that contains any word in the index is the .html file Everybody have the same result? Perhaps you are taking the output you see from "ant ExtensionFileHandler" as an indication of what words were indexed. This output, however, is showing Document.toString() which only shows the text in stored fields. This particular example does not actually index the documents - it shows the generalized handling framework and the parsing of the files into a Lucene Document. Most of the file handlers use unstored fields. The output I get is shown below. The handlers have successfully extracted the text from the files. Maybe you're referring to the FileIndexer example? We did not expose this one to the Ant launcher. If FileIndexer is the code you're trying, let me know what you've tried and how you're looking for the words that you expect to see. Again, most of the fields are unstored (meaning the original content is not stored in the index, only the terms extracted through analysis). Erik # to make the output cleaner for e-mailing I set ANT_ARGS like this: % echo $ANT_ARGS -logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true % ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/addressbook-entry.xml Buildfile: build.xml ExtensionFileHandler: This example demonstrates the file extension document handler. Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are all handled by the framework. The contents of the Lucene Document built for the specified file is displayed. skipping input as property nopause has already been set. skipping input as property file has already been set. Running lia.handlingtypes.framework.ExtensionFileHandler... log4j:WARN No appenders could be found for logger (org.apache.commons.digester.Digester.sax). log4j:WARN Please initialize the log4j system properly. Document Keyword Keyword Keyword Keyword Keyword Keyword Keyword> % ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html Buildfile: build.xml ExtensionFileHandler: This example demonstrates the file extension document handler. Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are all handled by the framework. The contents of the Lucene Document built for the specified file is displayed. skipping input as property nopause has already been set. skipping input as property file has already been set. Running lia.handlingtypes.framework.ExtensionFileHandler... Document Text> % ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PlainText.txt Buildfile: build.xml ExtensionFileHandler: This example demonstrates the file extension document handler. Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are all handled by the framework. The contents of the Lucene Document built for the specified file is displayed. skipping input as property nopause has already been set. skipping input as property file has already been set. Running lia.handlingtypes.framework.ExtensionFileHandler... Document> % ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf Buildfile: build.xml ExtensionFileHandler: This example demonstrates the file extension document handler. Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt are all handled by the framework. The contents of the Lucene Document built for the specified file is displayed. skipping input as property nopause has already been set. skipping input as property file has already been set. Running lia.handlingtypes.framework.ExtensionFileHandler... log4j:WARN No appenders could be found for logger (org.pdfbox.pdfparser.PDFParser). log4j:WARN Please initialize the log4j system properly. Document> % ant Extensio
Re: Optimising A Security Filter
Paul already replied, but I'll add my thoughts below to the thread also... On Dec 19, 2004, at 5:05 PM, Steve Skillcorn wrote: I bought the Lucene in Action ebook, which is excellent and I can strongly recommend. Thank you Does the IndexReader that is passed to the bits method of the filter represent the entire index, or just the results that match the query? It represents the entire index at the time it was instantiated. This is important to know in case documents are later added to the index. Is not worrying about filters and simply checking the returned Hit List before presenting a sensible approach? It depends. Is the performance of checking a relational database for the results being shown to the user acceptable? Is the security risk of a new piece of code forgetting to check the results of a search worth it? I can see the point to filters as presented in the Lucene in Action ISBN example, but are they a good approach where they could end up laboriously marking the entire index as True? Iterating through every document in the index certainly is time consuming and not something you should do for every search. However, filters are designed to be long-lived. Write your filter to simply do the logic of checking each document against the database, then wrap your filter with the caching wrapper. Be sure to use the same IndexReader for each search. When the index changes, rebuild the filter. There is no clear best way to do this type of filtering of results, I don't believe. There are details to consider for either of these approaches. All help greatly appreciated. Thanks to the authors for Lucene in Action, it's given me the high level best practices I was needing. Steve - I really appreciate hearing this. Putting this work to public scrutiny opens the possibilities of opinion. Your comments hearten me. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Number of documents
I've to show to my boss if Lucene is the best option for create a search engine of a new portal. I want to now how many documents do you have in your index? And how many bigger is your DB? the types of formats who has to support the portal are html jsp txt doc pdf ppt another question that I have is: I'm playing with the files of the book Lucene in Action and I try to use the example of handling types.The folder data contains 5 files, and created index contain five documents what the only one that contains any word in the index is the .html file Everybody have the same result? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Relevance percentage
How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good.
Re: Optimising A Security Filter
On Sunday 19 December 2004 23:05, Steve Skillcorn wrote: > Hello All; > > I bought the Lucene in Action ebook, which is > excellent and I can strongly recommend. One question > that has arisen from the book though is custom > filters. > > I have the situation where the text of my docs is in > Lucene, but the permissions are in my RDBMS. I can > write a filter (in fact have done so) that loops > through the documents in the passed IndexReader and > queries the DB to detect if the user is permissioned > for them, setting the relevant BitSet. My results are > then paged (< last | next >) to a web page. > > Does the IndexReader that is passed to the bits > method of the filter represent the entire index, or > just the results that match the query? The IndexReader represents the entire index. > Is not worrying about filters and simply checking the > returned Hit List before presenting a sensible > approach? That's is done by the IndexSearcher.search() methods that take a filter argument. > I can see the point to filters as presented in the > Lucene in Action ISBN example, but are they a good > approach where they could end up laboriously marking > the entire index as True? The filter is checked only for search results on the query over the whole index. The bit filters generally work well, except when you need a lot of very sparse filters and memory is a concern. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aramorph Analyzer
Hi, Sorry, I (the aramorph maintainer ;-) was absent from the office... Daniel Naber a écrit : Analyzers that provide ambiguous terms (i.e. a token with more than one term at the same position) don't work in Lucene 1.4. The is the correct answer. I've filled a bug about this : http://issues.apache.org/bugzilla/show_bug.cgi?id=23307 This feature has only recently been added to CVS. ... and I thank you very much for this commit. Notice however that you may experiment some problems with the query parser because Buckwalter's arabic transliteration uses the standard "*" joker character as a representation for dhal. Notice also that aramorph has a mailing-list for such questions : http://lists.nongnu.org/mailman/listinfo/aramorph-users Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] +33 (0)2 99 29 67 78 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]