File locking using java.nio.channels.FileLock
Hi: When is Lucene planning on moving toward java 1.4+? I see there are some problems caused from the current lock file implementation, e.g. Bug# 32171. The problems would be easily fixed by using the java.nio.channels.FileLock object. Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A question about scoring function in Lucene
I'll try to address all the comments here. The normalization I proposed a while back on lucene-dev is specified. Its properties can be analyzed, so there is no reason to guess about them. Re. Hoss's example and analysis, yes, I believe it can be demonstrated that the proposed normalization would make certain absolute statements like x and y meaningful. However, it is not a panacea -- there would be some limitations in these statements. To see what could be said meaningfully, it is necessary to recall a couple detailed aspects of the proposal: 1. The normalization would not change the ranking order or the ratios among scores in a single result set from what they are now. Only two things change: the query normalization constant, and the ad hoc final normalization in Hits is eliminated because the scores are intrinsically between 0 and 1. Another way to look at this is that the sole purpose of the normalization is to set the score of the highest-scoring result. Once this score is set, all the other scores are determined since the ratios of their scores to that of the top-scoring result do not change from today. Put simply, Hoss's explanation is correct. 2. There are multiple ways to normalize and achieve property 1. One simple approach is to set the top score based on the boost-weighted percentage of query terms it matches (assuming, for simplicity, the query is an OR-type BooleanQuery). So if all boosts are the same, the top score is the percentage of query terms matched. If there are boosts, then these cause the terms to have a corresponding relative importance in the determination of this percentage. More complex normalization schemes would go further and allow the tf's and/or idf's to play a role in the determination of the top score -- I didn't specify details here and am not sure how good a thing that would be to do. So, for now, let's just consider the properties of the simple boost-weighted-query-term percentage normalization. Hoss's example could be interpreted as single-term phrases "Doug Cutting" and "Chris Hostetter", or as two-term BooleanQuery's. Considering both of these cases illustrates the absolute-statement properties and limitations of the proposed normalization. If single-term PhraseQuery's, then the top score will always be 1.0 assuming the phrase matches (while the other results have arbitrary fractional scores based on the tfidf ratios as today). If the queries are BooleanQuery's with no boosts, then the top score would be 1.0 or 0.5 depending on whether 1 or two terms were matched. This is meaningful. In Lucene today, the top score is not meaningful. It will always be 1.0 if the highest intrinsic score is >= 1.0. I believe this could happen, for example, in a two-term BooleanQuery that matches only one term (if the tf on the matched document for that term is high enough). So, to be concrete, a score of 1.0 with the proposed normalization scheme would mean that all query terms are matched, while today a score of 1.0 doesn't really tell you anything. Certain absolute statements can therefore be made with the new scheme. This makes the absolute-threshold monitored search application possible, along with the segregating and filtering applications I've previously mentioned (call out good results and filter out bad results by using absolute thresholds). These analyses are simplified by using only BooleanQuery's, but I believe the properties carry over generally. Doug also asked about research results. I don't know of published research on this topic, but I can again repeat an experience from InQuira. We found that end users benefited from a search experience where good results were called out and bad results were downplayed or filtered out. And we managed to achieve this with absolute thresholding through careful normalization (of a much more complex scoring mechanism). To get a better intuitive feel for this, think about you react to a search where all the results suck, but there is no visual indication of this that is any different from a search that returns great results. Otis raised the patch I submitted for MultiSearcher. This addresses a related problem, in that the current MultiSearcher does not rank results equivalently to a single unified index -- specifically it fails Daniel Naber's test case. However, this is just a simple bug whose fix doesn't require the new normalization. I submitted a patch to fix that bug, along with a caveat that I'm not sure the patch is complete, or even consistent with the intentions of the author of this mechanism. I'm glad to see this topic is generating some interest, and apologize if anything I've said comes across as overly abrasive. I use and really like Lucene. I put a lot of focus on creating a great experience for the end user, and so am perhaps more concerned about quality of results and certain UI aspects than most other users. Chuck > -Original Message- > From: Doug Cutting [mailto:[EMAI
Re: Why does the StandardTokenizer split hyphenated words?
On Wednesday 15 December 2004 21:14, Mike Snare wrote: > Also, the phrase query > would place the same value on a doc that simply had the two words as a > doc that had the hyphenated version, wouldn't it? ÂThis seems odd. Not if these words are spelling variations of the same concept, which doesn't seem unlikely. > In addition, why do we assume that a-1 is a "typical product name" but > a-b isn't? Maybe for "a-b", but what about English words like "half-baked"? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Chris Hostetter wrote: For example, using the current scoring equation, if i do a search for "Doug Cutting" and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to "Doug Cutting" If I then do a search for "Chris Hostetter" and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for "Chris Hostetter" as document #1 is relevant to "Doug Cutting" y) document #5 is equally relevant to both "Chris Hostetter" and "Doug Cutting" That's right. Thanks for the nice description of the issue. I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x & y. And I am not convinced that, with the changes Chuck describes, one can be any more confident of x and y. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
: a-1 is considered a typical product name that needs to be unchanged : (there's a comment in the source that mentions this). Indexing : "hyphen-word" as two tokens has the advantage that it can then be found : with the following queries: : hypen-word (will be turned into a phrase query internally) : "hypen word" (phrase query) : (it cannot be found searching for hyphenword, however). This isn't an area of Lucene that I've had a chance to investigate much yet, but if I recall from my reading, Lucene allows you to place multiple token sequences at the same position, generating something more easily described as a "token graph" then a "token stream" .. correct? so given an input "the quick-brown fox jumped over the a-1 sauce" the tokenizer culd generate a token stream that looks like. "the" "quick" "brown" OR "quick-brown" OR "quickbrown" "fox" "jumped" "over" "the" "a" "1" OR "a-1" OR "a1" "sauce" ...at which point, a minimum 2 character word length filter, and stop words filter could (if you wanted to use them) reduce that to... "quick" "brown" OR "quick-brown" OR "quickbrown" "fox" "jumped" "over" "a-1" OR "a1" "sauce" allowing all of these future (phrase) searches to match... the quick brown fox jumped over the a1 sauce the quickbrown fox jumped over the a1 sauce the quick-brown fox jumped over the a1 sauce the quick brown fox jumped over the a 1 sauce the quickbrown fox jumped over the a 1 sauce the quick-brown fox jumped over the a 1 sauce the quick brown fox jumped over the a-1 sauce the quickbrown fox jumped over the a-1 sauce the quick-brown fox jumped over the a-1 sauce ...correct? or am I missunderstanding? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
On Dec 15, 2004, at 3:14 PM, Mike Snare wrote: [...] In addition, why do we assume that a-1 is a "typical product name" but a-b isn't? I am in no way second-guessing or suggesting a change, It just doesn't make sense to me, and I'm trying to understand. It is very likely, as is oft the case, that this is just one of those things one has to accept. It is one of those things we have to accept... or in this case write our own analyzer. An Analyzer is a very special and custom choice. StandardAnalyzer is a general purpose one, but quite insufficient in many cases. Like QueryParser. We're lucky to have these kitchen-sink pieces in Lucene to get us going quickly, but digging deeper we often need custom solutions. I'm working on indexing the e-book of Lucene in Action. I'll blog up the details of this in the near future as case-study material, but here's the short version... I got the PDF file, ran pdftotext on it. Many words are split across lines with a hyphen. Often these pieces should be combined with the hyphen removed. Sometimes, though, these words are to be split. The scenario is different than yours, because I want the hyphens gone - though sometimes they are a separator and sometimes they should be removed. It depends. I wrote a custom analyzer with several custom filters in the pipeline... dashes are originally kept in the stream, and a later filter combines two tokens and looks it up in an exception list and either combines it or leaves it separate. StandardAnalyzer would have wreaked havoc. The results of my work will soon be available to all to poke at, but for now a screenshot is all I have public: http://www.lucenebook.com Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Otis Gospodnetic wrote: There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is > X. So that is where the absolue value of the score would be useful. Right, but the question is, would a single score threshold be effective for all queries, or would one need a separate score threshold for each query? My hunch is that the latter is better, regardless of the scoring algorithm. Also, just because Lucene's default scoring does not guarantee scores between zero and one does not necessarily mean that these scores are less "meaningful". Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is > X. So that is where the absolue value of the score would be useful. I believe Chuck submitted some code that fixes this, which also helps with MultiSearcher, where you have to have this contant score in order to properly order hits from different Searchers, but I didn't dare to touch that code without further studying, for which I didn't have time. Otis --- Doug Cutting <[EMAIL PROTECTED]> wrote: > Chuck Williams wrote: > > I believe the biggest problem with Lucene's approach relative to > the pure vector space model is that Lucene does not properly > normalize. The pure vector space model implements a cosine in the > strictly positive sector of the coordinate space. This is guaranteed > intrinsically to be between 0 and 1, and produces scores that can be > compared across distinct queries (i.e., "0.8" means something about > the result quality independent of the query). > > I question whether such scores are more meaningful. Yes, such scores > > would be guaranteed to be between zero and one, but would 0.8 really > be > meaningful? I don't think so. Do you have pointers to research > which > demonstrates this? E.g., when such a scoring method is used, that > thresholding by score is useful across queries? > > Doug > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
> a-1 is considered a typical product name that needs to be unchanged > (there's a comment in the source that mentions this). Indexing > "hyphen-word" as two tokens has the advantage that it can then be found > with the following queries: > hypen-word (will be turned into a phrase query internally) > "hypen word" (phrase query) > (it cannot be found searching for hyphenword, however). Sure. But phrase queries are slower than a single word query. In my case, using the standard analyzer prior to my modification caused a single (hyphenated) word query to take upwards of 10 seconds (1M+ documents with ~400K terms). The exact same search with the new Analyzer takes <.5 seconds (granted the new tokenization caused a significant reduction in the number of terms). Also, the phrase query would place the same value on a doc that simply had the two words as a doc that had the hyphenated version, wouldn't it? This seems odd. In addition, why do we assume that a-1 is a "typical product name" but a-b isn't? I am in no way second-guessing or suggesting a change, It just doesn't make sense to me, and I'm trying to understand. It is very likely, as is oft the case, that this is just one of those things one has to accept. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
: I question whether such scores are more meaningful. Yes, such scores : would be guaranteed to be between zero and one, but would 0.8 really be : meaningful? I don't think so. Do you have pointers to research which : demonstrates this? E.g., when such a scoring method is used, that : thresholding by score is useful across queries? I freely admit that I'm way out of my league on these scoring discussions, but I believe what the OP was refering to was not any intrinsic benefit in having a score between 0 and 1, but of having a uniform normalization of scores regardless of search terms. For example, using the current scoring equation, if i do a search for "Doug Cutting" and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to "Doug Cutting" If I then do a search for "Chris Hostetter" and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for "Chris Hostetter" as document #1 is relevant to "Doug Cutting" y) document #5 is equally relevant to both "Chris Hostetter" and "Doug Cutting" I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x & y. If they are correct, then I for one can see a definite benefit in that. If for no other reason then in making minimum score thresholds more meaningful. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why does the StandardTokenizer split hyphenated words?
On Wednesday 15 December 2004 19:29, Mike Snare wrote: > In my case, the words are keywords that must remain as is, searchable > with the hyphen in place. ÂIt was easy enough to modify the tokenizer > to do what I need, so I'm not really asking for help there. ÂI'm > really just curious as to why it is that "a-1" is considered a single > token, but "a-b" is split. a-1 is considered a typical product name that needs to be unchanged (there's a comment in the source that mentions this). Indexing "hyphen-word" as two tokens has the advantage that it can then be found with the following queries: hypen-word (will be turned into a phrase query internally) "hypen word" (phrase query) (it cannot be found searching for hyphenword, however). Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Christoph Kiefer wrote: David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For illustration I quick and dirty wrote a class to perform that task. It uses the Jama.Matrix class to represent the similarity matrix at the moment. For show and tell I attached it to this email. Unfortunately it doesn't perform very well. My index stores about 25000 docs with a total of 75000 terms. The similarity matrix is very sparse but nevertheless needs about 1'875'000'000 entries!!! I think this current implementation will not be useable in this way. I also think I switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason. What do you think? I don't have any deep thoughts, just a few questions/ideas... [1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are your classes right, which are not in the mail, but presumably the source isn't needed. [2] Does the problem boil down to this line and the memory usage? double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments]; Thus using a sparse matrix would be a win, and so would using floats instead of doubles? [3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop words, as you do so later in getSimilarity(). [4] You can also consider using Colt possibly even JUNG: http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html http://jung.sourceforge.net/doc/api/index.html [5] Related to #2, can you precalc the matrix and store it on disk, or is your index too dynamic? [6] Also, in similar kinds of calculations I've seen code that filters out low frequency terms e.g. ignore all terms that don't occur in at least 5 docs. -- Dave Best, Christoph /* * Created on Dec 14, 2004 */ package ch.unizh.ifi.ddis.simpack.measure.featurevectors; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.snowball.SnowballAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermEnum; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import Jama.Matrix; /** * @author Christoph Kiefer */ public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure { private File indexDir = null; private File dataDir = null; private String target = ""; private String query = ""; private int targetDocumentNumber = -1; private final String ME = this.getClass().getName(); private int fileCounter = 0; public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) { this.indexDir = new File(indexDir); this.dataDir = new File(dataDir); this.target = target; this.query = query; } public String getName() { return "TFIDF_Lucene_Similarity_Measure"; } private void makeIndex() { try { IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS ), false); indexDirectory(writer, dataDir); writer.optimize(); writer.close(); } catch (Exception ex) { ex.printStackTrace(); } } private void indexDirectory(IndexWriter writer, File dir) { File[] files = dir.listFiles(); for (int i=0; i < files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(writer, f); // recurse } else if (f.getName().endsWith(".txt")) { indexFile(writer, f); } } } private void indexFile(IndexWriter writer, File f) { try { System.out.println( "Indexing " + f.getName() + ", " + (fileCounter++) ); String name = f.getCanonicalPath(); //System.out.println(name); Document doc = new Document(); doc.add( Field.Text( "contents", new FileReader(f), true ) ); writer.addDocument( doc );
Re: A question about scoring function in Lucene
Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., "0.8" means something about the result quality independent of the query). I question whether such scores are more meaningful. Yes, such scores would be guaranteed to be between zero and one, but would 0.8 really be meaningful? I don't think so. Do you have pointers to research which demonstrates this? E.g., when such a scoring method is used, that thresholding by score is useful across queries? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: C# Ports
I have created a DLL from the lucene jars for use in the PDFBox project. It uses IKVM(http://www.ikvm.net) to create a DLL from a jar. The binary version can be found here http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip This includes the ant script used to create the DLL files. This method is by far the easiest way to port it, see previous posts about advantages and disadvantages. Ben On Wed, 15 Dec 2004, Garrett Heaver wrote: > I was just wondering what tools (JLCA?) people are using to port Lucene to > c# as I'd be well interesting in converting things like snowball stemmers, > wordnet etc. > > > > Thanks > > Garrett > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Indexing a large number of DB records
Note that this really includes some extra steps. You don't need a temp index. Add everything to a single index using a single IndexWriter instance. No need to call addIndexes nor optimize until the end. Adding Documents to an index takes a constant amount of time, regardless of the index size, because new segments are created as documents are added, and existing segments don't need to be updated (only when merges happen). Again, I'd run your app under a profiler to see where the time and memory are going. Otis --- Garrett Heaver <[EMAIL PROTECTED]> wrote: > Hi Homan > > I had a similar problem as you in that I was indexing A LOT of data > > Essentially how I got round it was to batch the index. > > What I was doing was to add 10,000 documents to a temporary index, > use > addIndexes() to merge to temporary index into the live index (which > also > optimizes the live index) then delete the temporary index. On the > next loop > I'd only query rows from the db above the id in the maxdoc of the > live index > and set the max rows of the query to to 10,000 > i.e > > SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] > {ID from > Index.MaxDoc()} ORDER BY [id_field] ASC > > Ensuring that the documents go into the index sequentially your > problem is > solved and memory usage on mine (dotlucene 1.3) is low > > Regards > Garrett > > -Original Message- > From: Homam S.A. [mailto:[EMAIL PROTECTED] > Sent: 15 December 2004 02:43 > To: Lucene Users List > Subject: Indexing a large number of DB records > > I'm trying to index a large number of records from the > DB (a few millions). Each record will be stored as a > document with about 30 fields, most of them are > UnStored and represent small strings or numbers. No > huge DB Text fields. > > But I'm running out of memory very fast, and the > indexing is slowing down to a crawl once I hit around > 1500 records. The problem is each document is holding > references to the string objects returned from > ToString() on the DB field, and the IndexWriter is > holding references to all these document objects in > memory, so the garbage collector is getting a chance > to clean these up. > > How do you guys go about indexing a large DB table? > Here's a snippet of my code (this method is called for > each record in the DB): > > private void IndexRow(SqlDataReader rdr, IndexWriter > iw) { > Document doc = new Document(); > for (int i = 0; i < BrowseFieldNames.Length; i++) { > doc.Add(Field.UnStored(BrowseFieldNames[i], > rdr.GetValue(i).ToString())); > } > iw.AddDocument(doc); > } > > > > > > __ > Do you Yahoo!? > Yahoo! Mail - Find what you need with new enhanced search. > http://info.mail.yahoo.com/mail_250 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why does the StandardTokenizer split hyphenated words?
I am writing a tool that uses lucene, and I immediately ran into a problem searching for words that contain internal hyphens (dashes). After looking at the StandardTokenizer, I saw that it was because there is no rule that will matchor . Based on what I can tell from the source, every other term in a word containing any of the following (.,/-_) must contain at least one digit. I was wondering if someone could shed some light on why it was deemed necessary to prevent indexing a word like 'word-with-hyphen' without first splitting it into its constituent parts. The only reason I can think of (and the only one I've found) is to handle hyphenated words at line breaks, although my first thought would be that this would be undesired behavior, since a word that was broken due to a line break should actually be reconstructed, and not split. In my case, the words are keywords that must remain as is, searchable with the hyphen in place. It was easy enough to modify the tokenizer to do what I need, so I'm not really asking for help there. I'm really just curious as to why it is that "a-1" is considered a single token, but "a-b" is split. Anyone care to elaborate? Thanks, -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A question about scoring function in Lucene
Nhan, You are correct that dropping the document norm does cause Lucene's scoring model to deviate from the pure vector space model. However, including norm_d would cause other problems -- e.g., with short queries, as are typical in reality, the resulting scores with norm_d would all be extremely small. You are also correct that since norm_q is invariant, it does not affect relevance ranking. Norm_q is simply part of the normalization of final scores. There are many different formulas for scoring and relevance ranking in IR. All of these have some intuitive justification, but in the end can only be evaluated empirically. There is no "correct" formula. I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., "0.8" means something about the result quality independent of the query). Lucene does not have this property. Its formula produces scores of arbitrary magnitude depending on the query. The results cannot be compared meaningfully across queries; i.e., "0.8" means nothing intrinsically. To keep final scores between 0 and 1, Lucene introduces an ad hoc query-dependent final normalization in Hits: viz., it divides all scores by the highest score if the highest score happens to be greater than 1. This makes it impossible for an application to properly inform its users about the quality of the results, to cut off bad results, etc. Applications may do that, but in fact what they are doing is random, not what they think they are doing. I've proposed a fix for this -- there was a long thread on Lucene-dev. It is possible to revise Lucene's scoring to keep its efficiency, keep its current per-query relevance ranking, and yet intrinsically normalize its scores so that they are meaningful across queries. I posted a fairly detailed spec of how to do this in the Lucene-dev thread. I'm hoping to have time to build it and submit it as a proposed update to Lucene, but it is a large effort that would involve changing just about every scoring class in Lucene. I'm not sure it would be incorporated even if I did it as that would take considerable work from a developer. There doesn't seem to be much concern about these various scoring and relevancy ranking issues among the general Lucene community. Chuck > -Original Message- > From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED] > Sent: Wednesday, December 15, 2004 1:18 AM > To: Lucene Users List > Subject: RE: A question about scoring function in Lucene > > Thank for your answer, > In Lucene scoring function, they use only norm_q, > but for one query, norm_q is the same for all > documents. > So norm_q is actually not effect the score. > But norm_d is different, each document has a different > norm_d; it effect the score of document d for query q. > If you drop it, the score information is not correct > anymore or it not space vector model anymore. Could > you explain it a little bit. > > I think that it's expensive to computed in incremetal > indexing because when one document is added, idf of > each term changed. But drop it is not a good choice. > > What is the role of norm_d_t ? > Nhan. > > --- Chuck Williams <[EMAIL PROTECTED]> wrote: > > > Nhan, > > > > Re. your two differences: > > > > 1 is not a difference. Norm_d and Norm_q are both > > independent of t, so summing over t has no effect on > > them. I.e., Norm_d * Norm_q is constant wrt the > > summation, so it doesn't matter if the sum is over > > just the numerator or over the entire fraction, the > > result is the same. > > > > 2 is a difference. Lucene uses Norm_q instead of > > Norm_d because Norm_d is too expensive to compute, > > especially in the presence of incremental indexing. > > E.g., adding or deleting any document changes the > > idf's, so if Norm_d was used it would have to be > > recomputed for ALL documents. This is not feasible. > > > > Another point you did not mention is that the idf > > term is squared (in both of your formulas). Salton, > > the originator of the vector space model, dropped > > one idf factor from his formula as it improved > > results empirically. More recent theoretical > > justifications of tf*idf provide intuitive > > explanations of why idf should only be included > > linearly. tf is best thought of as the real vector > > entry, while idf is a weighting term on the > > components of the inner product. E.g., seen the > > excellent paper by Robertson, "Understanding inverse > > document frequency: on theoretical arguments for > > IDF", available here: > > http://www.emeraldinsight.
RE: Indexing a large number of DB records
Hi Homan I had a similar problem as you in that I was indexing A LOT of data Essentially how I got round it was to batch the index. What I was doing was to add 10,000 documents to a temporary index, use addIndexes() to merge to temporary index into the live index (which also optimizes the live index) then delete the temporary index. On the next loop I'd only query rows from the db above the id in the maxdoc of the live index and set the max rows of the query to to 10,000 i.e SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] > {ID from Index.MaxDoc()} ORDER BY [id_field] ASC Ensuring that the documents go into the index sequentially your problem is solved and memory usage on mine (dotlucene 1.3) is low Regards Garrett -Original Message- From: Homam S.A. [mailto:[EMAIL PROTECTED] Sent: 15 December 2004 02:43 To: Lucene Users List Subject: Indexing a large number of DB records I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i < BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello Homam, The batches I was referring to were batches of DB rows. Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X LIMIT=Y. Don't close IndexWriter - use the single instance. There is no MakeStable()-like method in Lucene, but you can control the number of in-memory Documents, the frequence of segment merges, and maximal size of an index segments with 3 IndexWriter parameters, described fairly verbosely in the javadocs. Since you are using the .Net version, you should really consult dotLucene guy(s). Running under the profiler should also tell you where the time and memory go. Otis --- "Homam S.A." <[EMAIL PROTECTED]> wrote: > Thanks Otis! > > What do you mean by building it in batches? Does it > mean I should close the IndexWriter every 1000 rows > and reopen it? Does that releases references to the > document objects so that they can be > garbage-collected? > > I'm calling optimize() only at the end. > > I agree that 1500 documents is very small. I'm > building the index on a PC with 512 megs, and the > indexing process is quickly gobbling up around 400 > megs when I index around 1800 documents and the whole > machine is grinding to a virtual halt. I'm using the > latest DotLucene .NET port, so may be there's a memory > leak in it. > > I have experience with AltaVista search (acquired by > FastSearch), and I used to call MakeStable() every > 20,000 documents to flush memory structures to disk. > There doesn't seem to be an equivalent in Lucene. > > -- Homam > > > > > > > --- Otis Gospodnetic <[EMAIL PROTECTED]> > wrote: > > > Hello, > > > > There are a few things you can do: > > > > 1) Don't just pull all rows from the DB at once. Do > > that in batches. > > > > 2) If you can get a Reader from your SqlDataReader, > > consider this: > > > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) > > > > 3) Give the JVM more memory to play with by using > > -Xms and -Xmx JVM > > parameters > > > > 4) See IndexWriter's minMergeDocs parameter. > > > > 5) Are you calling optimize() at some point by any > > chance? Leave that > > call for the end. > > > > 1500 documents with 30 columns of short > > String/number values is not a > > lot. You may be doing something else not Lucene > > related that's slowing > > things down. > > > > Otis > > > > > > --- "Homam S.A." <[EMAIL PROTECTED]> wrote: > > > > > I'm trying to index a large number of records from > > the > > > DB (a few millions). Each record will be stored as > > a > > > document with about 30 fields, most of them are > > > UnStored and represent small strings or numbers. > > No > > > huge DB Text fields. > > > > > > But I'm running out of memory very fast, and the > > > indexing is slowing down to a crawl once I hit > > around > > > 1500 records. The problem is each document is > > holding > > > references to the string objects returned from > > > ToString() on the DB field, and the IndexWriter is > > > holding references to all these document objects > > in > > > memory, so the garbage collector is getting a > > chance > > > to clean these up. > > > > > > How do you guys go about indexing a large DB > > table? > > > Here's a snippet of my code (this method is called > > for > > > each record in the DB): > > > > > > private void IndexRow(SqlDataReader rdr, > > IndexWriter > > > iw) { > > > Document doc = new Document(); > > > for (int i = 0; i < BrowseFieldNames.Length; i++) > > { > > > doc.Add(Field.UnStored(BrowseFieldNames[i], > > > rdr.GetValue(i).ToString())); > > > } > > > iw.AddDocument(doc); > > > } > > > > > > > > > > > > > > > > > > __ > > > Do you Yahoo!? > > > Yahoo! Mail - Find what you need with new enhanced > > search. > > > http://info.mail.yahoo.com/mail_250 > > > > > > > > > - > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > - > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > __ > Do you Yahoo!? > Take Yahoo! Mail with you! Get it on your mobile phone. > http://mobile.yahoo.com/maildemo > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: C# Ports
Hi Garrett, If you are referring to dotLucene (http://sourceforge.net/projects/dotlucene/) than I can tell you how -- not too long ago I posted on this list how I ported 1.4 and 1.4.3 to C#, please search the list for the answer -- you can't just use JLCA. As for the snwball, I have already started work on it. The port is done, but I have to test, etc. and I am too tied up right now with my work. However, I plan to release it before end of this month, so if you can wait, do wait, otherwise feel free to take the steps that I did to port Lucene to C#. Regards, -- George Aroush -Original Message- From: Garrett Heaver [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 15, 2004 5:58 AM To: [EMAIL PROTECTED] Subject: C# Ports I was just wondering what tools (JLCA?) people are using to port Lucene to c# as I'd be well interesting in converting things like snowball stemmers, wordnet etc. Thanks Garrett - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
C# Ports
I was just wondering what tools (JLCA?) people are using to port Lucene to c# as I'd be well interesting in converting things like snowball stemmers, wordnet etc. Thanks Garrett
RE: A question about scoring function in Lucene
Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit. I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams <[EMAIL PROTECTED]> wrote: > Nhan, > > Re. your two differences: > > 1 is not a difference. Norm_d and Norm_q are both > independent of t, so summing over t has no effect on > them. I.e., Norm_d * Norm_q is constant wrt the > summation, so it doesn't matter if the sum is over > just the numerator or over the entire fraction, the > result is the same. > > 2 is a difference. Lucene uses Norm_q instead of > Norm_d because Norm_d is too expensive to compute, > especially in the presence of incremental indexing. > E.g., adding or deleting any document changes the > idf's, so if Norm_d was used it would have to be > recomputed for ALL documents. This is not feasible. > > Another point you did not mention is that the idf > term is squared (in both of your formulas). Salton, > the originator of the vector space model, dropped > one idf factor from his formula as it improved > results empirically. More recent theoretical > justifications of tf*idf provide intuitive > explanations of why idf should only be included > linearly. tf is best thought of as the real vector > entry, while idf is a weighting term on the > components of the inner product. E.g., seen the > excellent paper by Robertson, "Understanding inverse > document frequency: on theoretical arguments for > IDF", available here: > http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl > if you sign up for an eval. > > It's easy to correct for idf^2 by using a customer > Similarity that takes a final square root. > > Chuck > > > -Original Message- > > From: Vikas Gupta [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, December 14, 2004 9:32 PM > > To: Lucene Users List > > Subject: Re: A question about scoring function > in Lucene > > > > Lucene uses the vector space model. To > understand that: > > > > -Read section 2.1 of "Space optimizations for > Total Ranking" paper > > (Linked > > here > http://lucene.sourceforge.net/publications.html) > > -Read section 6 to 6.4 of > > > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf > > -Read section 1 of > > > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps > > > > Vikas > > > > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: > > > > > Hi all, > > > Lucene score document based on the correlation > between > > > the query q and document t: > > > (this is raw function, I don't pay attention > to the > > > boost_t, coord_q_d factor) > > > > > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d > * idf_t > > > / norm_d_t) (*) > > > > > > Could anybody explain it in detail ? Or are > there any > > > papers, documents about this function ? > Because: > > > > > > I have also read the book: Modern Information > > > Retrieval, author: Ricardo Baeza-Yates and > Berthier > > > Ribeiro-Neto, Addison Wesley (Hope you have > read it > > > too). In page 27, they also suggest a scoring > funtion > > > for vector model based on the correlation > between > > > query q and document d as follow (I use > different > > > symbol): > > > > > > sum_t( weight_t_d * weight_t_q) > > > score_d(d, q)= > - (**) > > > norm_d * norm_q > > > > > > where weight_t_d = tf_d * idf_t > > > weight_t_q = tf_q * idf_t > > > norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) > ) > > > norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) > ) > > > > > > (**): sum_t( tf_q*idf_t * tf_d*idf_t) > > > score_d(d, > q)=- (***) > > >norm_d * norm_q > > > > > > The two function, (*) and (***), have 2 > differences: > > > 1. in (***), the sum_t is just for the > numerator but > > > in the (*), the sum_t is for everything. So, > with > > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is > > > calculated twice. Is this right? please > explain. > > > > > > 2. No factor that define norms of the > document: norm_d > > > in the function (*). Can you explain this. > what is the > > > role of factor norm_d_t ? > > > > > > One more question: could anybody give me > documents, > > > papers that explain this function in detail. > so when I > > > apply Lucene for my system, I can adapt the > document, > > > and the field so that I still receive
Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception
This is a OS file system error not a Lucene issue (not for this board) , Google it for Gentoo specifically you a get a whole bunch of results one of which is this thread on the Gentoo Forums, http://forums.gentoo.org/viewtopic.php?t=9620 Good Luck Nader Henein Karthik N S wrote: Hi Guys Some body tell me what this Exception am Getting Pleae Sys Specifications O/s Linux Gentoo Appserver Apache Tomcat/4.1.24 Jdk build 1.4.2_03-b02 Lucene 1.4.1 ,2, 3 Note: - This Exception is displayed on Every 2nd Query after Tomcat is started java.io.IOException: Stale NFS file handle at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:307) at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou ndFileReader.java:220) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69) at org.apache.lucene.search.Similarity.idf(Similarity.java:255) at org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery. java:47) at org.apache.lucene.search.Query.weight(Query.java:86) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java: 251) WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception
Karthik N S wrote: java.io.IOException: Stale NFS file handle You have a file system NFS mounted on this machine, but the machine hosting the real file system has no knowledge of your mount. This often happens after the host machine has had a reboot. Solution: unmount (and possibly re-mount) the failing NFS file system. If you're not sure which one it is, try looking at a file on each NFS file system with, say, "cat" or "wc" and see if you get a stale NFS handle error. You may need "umount -f" to unmount the failing file system. Sometimes, very occasioanlly, you'll have to resort to a reboot. jch - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
David, Bruce, Otis, Thank you all for the quick replies. I looked through the BooksLikeThis example. I also agree, it's a very good and effective way to find similar docs in the index. Nevertheless, what I need is really a similarity matrix holding all TF*IDF values. For illustration I quick and dirty wrote a class to perform that task. It uses the Jama.Matrix class to represent the similarity matrix at the moment. For show and tell I attached it to this email. Unfortunately it doesn't perform very well. My index stores about 25000 docs with a total of 75000 terms. The similarity matrix is very sparse but nevertheless needs about 1'875'000'000 entries!!! I think this current implementation will not be useable in this way. I also think I switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason. What do you think? Best, Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html /* * Created on Dec 14, 2004 */ package ch.unizh.ifi.ddis.simpack.measure.featurevectors; import java.io.File; import java.io.FileReader; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.StopAnalyzer; import org.apache.lucene.analysis.snowball.SnowballAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.index.TermDocs; import org.apache.lucene.index.TermEnum; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import Jama.Matrix; /** * @author Christoph Kiefer */ public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure { private File indexDir = null; private File dataDir = null; private String target = ""; private String query = ""; private int targetDocumentNumber = -1; private final String ME = this.getClass().getName(); private int fileCounter = 0; public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) { this.indexDir = new File(indexDir); this.dataDir = new File(dataDir); this.target = target; this.query = query; } public String getName() { return "TFIDF_Lucene_Similarity_Measure"; } private void makeIndex() { try { IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS ), false); indexDirectory(writer, dataDir); writer.optimize(); writer.close(); } catch (Exception ex) { ex.printStackTrace(); } } private void indexDirectory(IndexWriter writer, File dir) { File[] files = dir.listFiles(); for (int i=0; i < files.length; i++) { File f = files[i]; if (f.isDirectory()) { indexDirectory(writer, f); // recurse } else if (f.getName().endsWith(".txt")) { indexFile(writer, f); } } } private void indexFile(IndexWriter writer, File f) { try { System.out.println( "Indexing " + f.getName() + ", " + (fileCounter++) ); String name = f.getCanonicalPath(); //System.out.println(name); Document doc = new Document(); doc.add( Field.Text( "contents", new FileReader(f), true ) ); writer.addDocument( doc ); if ( name.matches( dataDir + "/" + target + ".txt" ) ) { targetDocumentNumber = writer.docCount(); } } catch (Exception ex) { ex.printStackTrace(); } } public Matrix getTFIDFMatrix(File indexDir) throws IOException { Directory fsDir = FSDirectory.getDirectory( indexDir, false ); IndexReader reader = IndexReader.open( fsDir ); int numberOfTerms = 0; int numberOfDocuments = reader.numDocs(); TermEnum allTerms = reader.terms(); for ( ; allTerms.next(); ) { allTerms.term(); numberOfTerms++; } System.out.println( "Total number of terms in index is " + numberOfTerms ); System.out.println( "Total number of documents in index is " + numberOfDocuments ); double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments]; for ( int i = 0; i < numberOfTerms; i++ ) { for ( int j = 0; j < numberOfDocuments; j++ ) { TFIDFMatrix[i][j] = 0.0; } } allTerms = reader.terms(); for ( int i = 0 ; allTerms.next(); i++ ) { Term term = allTerms.term(); TermDocs td = reader.termDocs(term); for ( ; td.next(); ) { TFIDFMatrix[i][td.doc()] = td.freq(); } } allTerms = reader.terms(); for ( int i = 0 ; allTerms.next(); i++ ) { for ( int j = 0; j < numberOfDocuments; j++ ) { double tf = TFIDFMatrix[i][j]; double docFreq = (double)allTerms.docFreq(); double idf = ( Math.log( (double)numberOfDocuments / docFreq ) ) / 2.30258509299405; //System.out.println( "Term: " + i + " Document " + j + " TF/DocFreq/IDF: " + tf + " " + docFreq + " " + idf); TFIDFMatrix[i][j] = tf * idf;