Re: Query Tuning
On Monday 21 February 2005 19:59, Runde, Kevin wrote: Hi All, How does Lucene handle multi term queries? Does it use short circuiting? So if a user entered: (a OR b) AND c But my program knew testing for c is cheaper than testing for (a OR b) and I rewrote the query as: c AND (a OR b) Would the query run faster? Exchanging the operands of AND would not make a noticeable difference in speed. Queries are evaluated by iterating the inverted term index entries for all query terms in parallel, with buffering. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Tuning
On Monday 21 February 2005 20:43, Todd VanderVeen wrote: Runde, Kevin wrote: Hi All, How does Lucene handle multi term queries? Does it use short circuiting? So if a user entered: (a OR b) AND c But my program knew testing for c is cheaper than testing for (a OR b) and I rewrote the query as: c AND (a OR b) Would the query run faster? Sorry if this has already be answered, but for some reason the Archive search is not working for me today. Thanks, Kevin Not sure about what is in CVS, but look at BooleanQuery.scorer(). If all It's in svn nowadays. of the clauses of the BooleanQuery are required and none of the clauses are BooleanQueries a ConjunctionScorer is returned that offers the optimizations you seek. In the example you gave, there is a clause that is boolean ( a or b) that will have to be evaluated independently with a boolean scorer. This will be performed regardless of the ordering. (BooleanScorer doesn't preserve document order when it return results and hence it can't utilize the optimal algorithm provided by ConjuntionScorer). Others have been down this path as evidenced by the sigh in the javadoc. In the svn version a ConjunctionScorer is used for all top level AND queries. If calculating (a or b) is expensive and the docFreq of a is much less than the union of a and b, you might consider rewriting it to (a and c) or (b and c) using DeMorgan's law. Expansion like this isn't always beneficial and can't be applied blindly. As far as I can tell there is In the svn version the subquery (a or b) is only evaluated for documents matching c. In the current version the expansion to (a and c) or (b and c) might help: the tradeoff is between evaluating c twice and having less work for the OR operator. no query planning/optimization aside from the merging of related clauses and attempts to rewrite to simpler queries. One optimization in the current version is the use of ConjunctionScorer for some cases. One such case, which happens a lot in practice, is a query that has a few required terms. Another optimization in the current version that some scoring is done ahead for each clause into an unordered buffer. This helps for top level OR queries, but loses for OR queries that are subqueries of AND. The svn version does not score ahead. It relies on the buffering done by TermScorer. Perhaps the buffering for a TermScorer should be made dependent on it's expected use: more buffering for top level OR, less buffering when used under AND. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optional Terms in a single query
On Monday 21 February 2005 23:23, Luke Shannon wrote: Hi; I'm trying to create a query that look for a field containing type:181 and name doesn't contain tim, bill or harry. type: 181 -(name: tim name:bill name:harry) +(type: 181) +((-name: tim -name:bill -name:harry +oldfaith:stillHere)) stillHere is normally lowercased before searching. Is that ok? +(type: 181) +((-name: tim OR bill OR harry +oldfaith:stillHere)) +(type: 181) +((-name:*(tim bill harry)* +olfaithfull:stillhere)) typo? olfaithfull +(type:1 81) +((-name:*(tim OR bill OR harry)* +olfaithfull:stillhere)) typo? (type:1 81) I would really think to do this all in one Query. Is this even possible? How would you want to combine the results? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
Erik, On Saturday 19 February 2005 01:33, Erik Hatcher wrote: On Feb 18, 2005, at 6:37 PM, Paul Elschot wrote: On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Saturday 19 February 2005 11:02, Erik Hatcher wrote: On Feb 19, 2005, at 3:52 AM, Paul Elschot wrote: By lowercasing the querytext and searching in title_lc ? Well sure, but how about this query: title:Something AND anotherField:someOtherValue QueryParser, as-is, won't be able to do field-name swapping. I could certainly apply that technique on all the structured queries that I build up with the API, but with QueryParser it is trickier. I'm definitely open for suggestions on improving how case is handled. The Overriding this (1.4.3 QueryParser.jj, line 286) might work: protected Query getFieldQuery(String field, String queryText) throws ParseException { ... } It will be called by the parser for both parts of the query above, so one could change the field depending on the requested type of search and the field name in the query. But that wouldn't work for any other type of query title:somethingFuzzy~ To get that it would be necessary to override all query parser methods that take a field argument. Though now that I think more about it, a simple s/title:/title_orig:/ before parsing would work, and of course make the default field In the overriding getFieldQuery method something like: if (caseSensitiveSearch(field) originalFieldIndexed(field)) { field = field + _orig; } else { //the other 3 cases ... } return super.getFieldQuery(field, queryText); The if statement could be factored out for the other overriding methods. dynamic. I need to evaluate how many fields would need to be done this way - it'd be several. Thanks for the food for thought! only drawback now is that I'm duplicating indexes, but that is only an issue in how long it takes to rebuild the index from scratch (currently about 20 minutes or so on a good day - when the machine isn't swamped). Once the users get the hang of this, you might end up having to quadruple the index, or more. Why would that be? They want a case sensitive/insensitive switch. How would it expand beyond that? With an index for every combination of fields and case sensitivity for these fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Concurrent searching re-indexing
Ok, I will change my reindex method to delete all documents and then re-add them all, rather than using an IndexWriter to write a completely new index. Thanks for the help on this everyone. Paul -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: 17 February 2005 22:26 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Paul Mellor wrote: I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? [ ...] java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) [...] This is running on Windows 2000. On Windows one cannot delete a file while it is still open. So, no, on Windows one cannot remove an index entirely while an IndexReader or Searcher is still open on it, since it is simply impossible to remove all the files in the index. We might attempt to patch this by keeping a list of such files and attempt to delete them later (as is done when updating an index). But this could cause problems, as a new index will eventually try to use these same file names again, and it would then conflict with the open IndexReader. This is not a problem when updating an existing index, since filenames (except for a few which are not kept open, like segments) are never reused in the lifetime of an index. So, in order for such a fix to work we would need to switch to globally unique segment names, e.g., long random strings, rather than increasing integers. In the meantime, the safe way to rebuild an index from scratch while other processes are reading it is simply to delete all of its documents, then start adding new ones. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company.
Re: Lucene in the Humanities
Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in the Humanities
On Friday 18 February 2005 21:55, Erik Hatcher wrote: On Feb 18, 2005, at 3:47 PM, Paul Elschot wrote: Erik, Just curious: it would seem easier to use multiple fields for the original case and lowercase searching. Is there any particular reason you analyzed the documents to multiple indexes instead of multiple fields? I considered that approach, however to expose QueryParser I'd have to get tricky. If I have title_orig and title_lc fields, how would I allow freeform queries of title:something? By lowercasing the querytext and searching in title_lc ? Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Concurrent searching re-indexing
Otis, Looking at your reply again, I have a couple of questions - IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. 1. If IndexReader takes a snapshot of the index state when opened and then reads the files when searching, what would happen if the files it takes a snapshot of are deleted before the search is performed (as would happen with a reindexing in the period between opening an IndexSearcher and using it to search)? 2. Does a similar potential problem exist when optimising an index, if this combines all the segments into a single file? Many thanks Paul -Original Message- From: Paul Mellor [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:37 To: 'Lucene Users List' Subject: RE: Concurrent searching re-indexing But all write access to the index is synchronized, so that although multiple threads are creating an IndexWriter for the same directory and using it to totally recreate that index, only one thread is doing this at once. I was concerned about the safety of using an IndexSearcher to perform queries on an index that is in the process of being recreated from scratch, but I guess that if the IndexSearcher takes a snapshot of the index when it is created (and in my code this creation is synchronized with the write operations as well so that the threads wait for the write operations to finish before instantiating an IndexSearcher, and vice versa) this can't be a problem. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:30 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official
Re: Multiple Keywords/Keyphrases fields
On Wednesday 16 February 2005 06:49, Owen Densmore wrote: From: Erik Hatcher [EMAIL PROTECTED] Date: February 12, 2005 3:09:15 PM MST To: Lucene Users List lucene-user@jakarta.apache.org Subject: Re: Multiple Keywords/Keyphrases fields The real question to answer is what types of queries you're planning on making. Rather than look at it from indexing forward, consider it from searching backwards. How will users query using those keyword phrases? Hi Erik. Good point. There are two uses we are making of the keyphrases: - Graphical Navigation: A Flash graphical browser will allow users to fly around in a space of documents, choosing what to be viewing: Authors, Keyphrases and Textual terms. In any of these cases, the closeness of any of the fields will govern how close they will appear graphically. In the case of authors, we will weight collaboration .. how often the authors work together. In the case of Keyphrases, we will want to use something like distance vectors like you show in the book using the cosine measure. Thus the keyphrases need to be separate entities within the document .. it would be a bug for us if the terms leaked across the separate kephrases within the document. - Textual Search: In this case, we will have two ways to search the keyphrases. The first would be like the graphical navigation above where searching for complex system should require the terms to be in a single keyphrase. The second way will be looser, where we may simply pool the keyphrases with titles and abstract, and allow them all to be searched together within the document. Does this make sense? So the question from the search standpoint is: do multiple instances of a field act like there are barriers across the instances, or are they somehow treated as a single instance somehow. Multiple field instances with the same name in a document are concatenated in the index in the order in which they where added to the document. For each instance of a field in the document, even when it has the same name, the analyzer is asked to provide a new tokenstream. This happens in org.apache.lucene.index.DocumentWriter.invertDocument(), The last position offset in the field as indexed is maintained for this purpose. In terms of the closeness calculation, for example, can we get separate term vectors for each instance of the keyphrase field, or will we get a single vector combining all the keyphrase terms within a single document? The positions in the TermVectors are treated in the same way. To put a barrier between field instances with the same name one can put a gap in the indexed term positions. This gap needs a larger query proximity to match. AND like queries will match in the indexed field. A gap is implemented by providing the a tokenstream from the analyzer that has a position increment that equals the gap for the first token in the stream. For the first field instance with same name the gap is not needed. Regards, Paul Elschot I hope this is clear! Kinda hard to articulate. Owen Erik On Feb 12, 2005, at 3:08 PM, Owen Densmore wrote: I'm getting a bit more serious about the final form of our lucene index. Each document has DocNumber, Authors, Title, Abstract, and Keywords. By Keywords, I mean a comma separated list, each entry having possibly many terms in a phrase like: temporal infomax, finite state automata, Markov chains, conditional entropy, neural information processing I presume I should be using a field Keywords which have many entries or instances per document (one per comma separated phrase). But I'm not sure the right way to handle all this. My assumption is that I should analyze them individually, just as we do for free text (the Abstract, for example), thus in the example above having 5 entries of the nature doc.add(Field.Text(Keywords, finite state automata)); etc, analyzing them because these are author-supplied strings with no canonical form. For guidance, I looked in the archive and found the attached email, but I didn't see the answer. (I'm not concerned about the dups, I presume that is equivalent to a boos of some sort) Does this seem right? Thanks once again. Owen From: [EMAIL PROTECTED] [EMAIL PROTECTED] Subject: Multiple equal Fields? Date: Tue, 17 Feb 2004 12:47:58 +0100 Hi! What happens if I do this: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, blah)); Is there a field foo with value blah or are there two foos (actually not possible) or is there one foo with the values bar and blah? And what does happen in this case: doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); doc.add(Field.Text(foo, bar)); Does lucene store this only once? Timo - To unsubscribe, e
RE: Concurrent searching re-indexing
But all write access to the index is synchronized, so that although multiple threads are creating an IndexWriter for the same directory and using it to totally recreate that index, only one thread is doing this at once. I was concerned about the safety of using an IndexSearcher to perform queries on an index that is in the process of being recreated from scratch, but I guess that if the IndexSearcher takes a snapshot of the index when it is created (and in my code this creation is synchronized with the write operations as well so that the threads wait for the write operations to finish before instantiating an IndexSearcher, and vice versa) this can't be a problem. -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 16 February 2005 17:30 To: Lucene Users List Subject: Re: Concurrent searching re-indexing Hi Paul, If I understand your setup correctly, it looks like you are running multiple threads that create IndexWriter for the ame directory. That's a no no. This section (first hit) describes all various concurrency issues with regards to adds, updates, optimization, and searches: http://www.lucenebook.com/search?query=concurrent IndexSearcher (IndexReader, really) does take a snapshot of the index state when it is opened, so at that time the index segments listed in segments should be in a complete state. It also reads index files when searching, of course. Otis --- Paul Mellor [EMAIL PROTECTED] wrote: Hi, I've read from various sources on the Internet that it is perfectly safe to simultaneously search a Lucene index that is being updated from another Thread, as long as all write access to the index is synchronized. But does this apply only to updating the index (i.e. deleting and adding documents), or to a complete re-indexing (i.e. create a new IndexWriter with the 'create' argument true and then re-add all the documents)? I have a class which encapsulates all access to my index, so that writes can be synchronized. This class also exposes a method to obtain an IndexSearcher for the index. I'm running unit tests to test this which create many threads - each thread does a complete re-indexing and then obtains an IndexSearcher and does a query. I'm finding that with sufficiently high numbers of threads, I'm getting the occasional failure, with the following exception thrown when attempting to construct a new IndexWriter (during the reindexing) - java.io.IOException: couldn't delete _a.f1 at org.apache.lucene.store.FSDirectory.create(FSDirectory.java:166) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:135) at org.apache.lucene.store.FSDirectory.getDirectory(FSDirectory.java:113) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:151) ... The exception occurs quite infrequently (usually for somewhere between 1-5% of the Threads). Does the IndexSearcher take a 'snapshot' of the index at creation? Or does it access the filesystem whilst searching? I am also synchronizing creation of the IndexSearcher with the write lock, so that the IndexSearcher is not created whilst the index is being recreated (and vice versa). But do I need to ensure that the IndexSearcher cannot search whilst the index is being recreated as well? Note that a similar unit test where the threads update the index (rather than recreate it from scratch) works fine, as expected. This is running on Windows 2000. Any help would be much appreciated! Paul This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any opinions expressed in this e-mail and/or files transmitted with it that do not relate to the official business of this company are those solely of the author and should not be interpreted as being endorsed by this company. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ This e-mail has been scanned for viruses by MCI's Internet Managed Scanning Services - powered by MessageLabs. For further information visit http://www.mci.com This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy, retransmit or use the e-mail and/or files transmitted with it and should not disclose their contents. In such a case, please notify [EMAIL PROTECTED] and delete the message from your own system. Any
Re: Newbie questions
Hi again, So is SqlDirectory recommended for use in a cluster to workaround the accessibility problem, or are people using NFS or a standalone server instead? Thanks in advance, PJ --- Paul Jans [EMAIL PROTECTED] wrote: I've already ordered Lucene in Action :) There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ I will keep an eye on that for sure. You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) We're already using Oracle, so would it be possible to store the index there, thus giving each cluster node easy access to it. I read about SqlDirectory in the archives but it looks like it didn't make it to the API and I don't see it on the contrib page. I'm more concerned about making the index accessible rather than transactional consistency, so NFS may be another option like you mention. I'm curious to hear about other systems which are clustered and how others are doing this; lessons learnt and best practices etc. Thanks again for the help. Lucene looks like a first class tool. PJ --- Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: chained restrictive queries
On Monday 14 February 2005 15:14, [EMAIL PROTECTED] wrote: Hi, I'm currently working on application using Lucene 1.3 , and have to improve the current indexation/search methods with the 1.4.3 version. I was thinking to use the FilteredQuery object to refine my chained queries but, after some tests, performances are worst :(. The chained queries were like : - a first boolean query to retrieve a set of doc id matching some criterias A FilteredQuery works best when the filter from the criterias can be reused, eg. by keeping it in a cache, possibly with CachingWrapperFilter. - a second query applying a fuzzy criteria to refine it more deeply. My index contains like 7 millions of document at all , and first query should retrieve, at maximum, like 50 000 documents. I'm currently working with crossed indexes while doing searches , but i want to remove the extra indexes and do all things with only one. So, is it possible to use the FilteredQuery object or another one to chain queries from the most restrictive to the most open one ? It is possible, but whether it helps performance depends on your circumstances. The 1.4.3 filter implementation executes the most open query almost completely. It only applies the filter after the score computations for the query being filtered, just before deciding whether to keep the docment in the query results. This is done in IndexSearcher.search(). A profiler might tell you whether that is a bottleneck for your queries. If it is, there is some code in development that might help . In case it turns out that the memory occupied by the BitSet of the filter is a bottleneck, please check the (very) recent archives of lucene-dev on BitSet implementation. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie questions
I've already ordered Lucene in Action :) There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ I will keep an eye on that for sure. You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) We're already using Oracle, so would it be possible to store the index there, thus giving each cluster node easy access to it. I read about SqlDirectory in the archives but it looks like it didn't make it to the API and I don't see it on the contrib page. I'm more concerned about making the index accessible rather than transactional consistency, so NFS may be another option like you mention. I'm curious to hear about other systems which are clustered and how others are doing this; lessons learnt and best practices etc. Thanks again for the help. Lucene looks like a first class tool. PJ --- Erik Hatcher [EMAIL PROTECTED] wrote: On Feb 10, 2005, at 5:00 PM, Paul Jans wrote: A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. Don't forget to get your copy of Lucene in Action too :) 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). Some suggestions: index degree as a Keyword field. Pad GPA, so that all of them are the form #.# (or #.## maybe). Numerics need to be lexicographically ordered, and thus padded. With the right analyzer (see the AnalysisParalysis page on the wiki) you could use this type of query with QueryParser:' degree:cs AND gpa:[3.0 TO 9.9] 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? There is a LuceneRAR project that is still in its infancy here: https://lucenerar.dev.java.net/ You can also store a Lucene index in Berkeley DB (look at the /contrib/db area of the source code repository) However, most projects do fine with cruder techniques such as sharing the Lucene index on a common drive and ensuring that locking is configured to use the common drive also. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Problem searching Field.Keyword field
On Thursday 10 February 2005 18:44, Luke Shannon wrote: Are there any issues with having a bunch of boolean queries and than adding them to one big boolean queries (making them all required)? The 1.4.3 and earlier BooleanScorer has an out of bounds exception for More than 32 required/prohibited clauses in query. In the development version this restriction has gone. The limitation of the maximum clause count (default 1024, configurable) is still there. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Newbie questions
Hi, A couple of newbie questions. I've searched the archives and read the Javadoc but I'm still having trouble figuring these out. 1. What's the best way to index and handle queries like the following: Find me all users with (a CS degree and a GPA 3.0) or (a Math degree and a GPA 3.5). 2. What are the best practices for using Lucene in a clustered J2EE environment? A standalone index/search server or storing the index in the database or something else ? Thank you in advance, PJ __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for doc without a field
On Friday 04 February 2005 17:29, Bill Tschumy wrote: On Feb 4, 2005, at 10:19 AM, Bill Tschumy wrote: On Feb 3, 2005, at 2:04 PM, Paul Elschot wrote: On Thursday 03 February 2005 20:18, Bill Tschumy wrote: Is there any way to construct a query to locate all documents without a specific field? By this I mean the Document was created without ever having that field added to it. One way is to add an extra document field containing the field names of all (other) indexed fields in the document. Assuming there is always a primary key field the query is then: +fieldnames:primarykeyfield -fieldnames:specificfield Regards, Paul Elschot Paul, Thanks for the suggestion, but I need to do this on an existing database as it is. It just occurred to me that I should try a query on the field with a value of NULL. Don't know if that will work or not. Nope, using null as a search value just result in a NullPointerException. It's not impossible, but the problem is that the term index is first sorted by field name, then by term text, then by document number, and then by term position within document. That means that the index path is no good to query for field name and document number: you have to check all indexed terms in between. Lucene only allows to find the existence of a indexed field, the indexed terms (field name + term text) in sorted order from a given term, and the indexed documents of a term, possibly combined with the with the term positions within each document. The solution above shortcuts the index path by putting the field name in place of the term text for a special field. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rewrite causes BooleanQuery to loose required terms
On Thursday 03 February 2005 11:38, Nick Burch wrote: Hi All I'm using lucene from CVS, and I've discovered the rewriting a BooleanQuery created with the old style (Query,boolean,boolean) method, the rewrite will cause the required parameters to get lost. Using old style (Query,boolean,boolean): query = +contents:test* +(class:1.2 class:1.2.*) rewritten query = (contents:tester contents:testing contents:tests) (class:1.2 (class:1.2.3 class:1.2.4)) Using new style (Query,BooleanClause.Occur.MUST): query = +contents:test* +(class:1.2 class:1.2.*) rewritten query = +(contents:tester contents:testing contents:tests) +(class:1.2 (class:1.2.3 class:1.2.4)) Attached is a simple RAMDirectory test to show this. I know that the (Query,boolean,boolean) method is depricated, but should it also be broken? No. Currently, the old constructor for BooleanClause does not carry the old state forward. The new constructor does carry the new state backward. I'll post a fix in bugzilla later. Thanks, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for doc without a field
On Thursday 03 February 2005 20:18, Bill Tschumy wrote: Is there any way to construct a query to locate all documents without a specific field? By this I mean the Document was created without ever having that field added to it. One way is to add an extra document field containing the field names of all (other) indexed fields in the document. Assuming there is always a primary key field the query is then: +fieldnames:primarykeyfield -fieldnames:specificfield Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compile lucene
Helen, On Wednesday 02 February 2005 20:26, Helen Butler wrote: Hi Im trying to Compile Lucene but am encountering the following error on typing ant from the root of Lucene-1.4.3 C:\lucene-1.4.3ant Buildfile: build.xml init: compile-core: BUILD FAILED C:\lucene-1.4.3\build.xml:140: srcdir C:\lucene-1.4.3\src\java does not e= xist! It seems the java source files were not extracted. How did you obtain the build.xml file? Once the compilation works, you'll notice that the lucene jar being built has a 1.5 version number because of an incorrect version number in the 1.4.3 build.xml. You need to correct the version property in the build.xml file: property name=version value=1.4.3/ Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compile lucene
Helen, I downloaded lucene-1.4.3.zip myself from one of the mirrors (http://apache.essentkabel.com/jakarta/lucene/binaries/) It contains the lucene demo's, and not the java sources. The lucene-1.4.3.tar.gz there has the same problem. It seems something is wrong with the 1.4.3 distribution. When you need the lucene 1.4.3 jar you can download it from the above mirror, it looks ok. to me. In case you have done something like this before: The following command (on a single line) will checkout the source files from cvs into directory lucene-1.4.3 (make sure that directory is empty beforehand): cvs -d :pserver:[EMAIL PROTECTED]:/home/cvspublic checkout -r lucene_1_4_3 -d lucene-1.4.3 jakarta_lucene In there you can correct the build.xml file and do: ant compile to compile the source code. Regards, Paul Elschot On Wednesday 02 February 2005 20:55, Helen Butler wrote: Hi Paul, Thanks for your quick response. The Build.xml was obtained from the Lucene-1.4.3.zip that I downloaded from the apache website. I changed the version in the xml file as you suggested, however the error persists. Kind Regards, Helen Butler -Original Message- From: Paul Elschot [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Date: Wed, 2 Feb 2005 20:39:01 +0100 Subject: Re: Compile lucene Helen, On Wednesday 02 February 2005 20:26, Helen Butler wrote: Hi Im trying to Compile Lucene but am encountering the following error on typing ant from the root of Lucene-1.4.3 C:\lucene-1.4.3ant Buildfile: build.xml init: compile-core: BUILD FAILED C:\lucene-1.4.3\build.xml:140: srcdir C:\lucene-1.4.3\src\java does not e= xist! It seems the java source files were not extracted. How did you obtain the build.xml file? Once the compilation works, you'll notice that the lucene jar being built has a 1.5 version number because of an incorrect version number in the 1.4.3 build.xml. You need to correct the version property in the build.xml file: property name=version value=1.4.3/ Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Subversion conversion
On Wednesday 02 February 2005 21:20, Erik Hatcher wrote: The conversion to Subversion is complete. The new repository is available to users read-only at: http://svn.apache.org/repos/asf/lucene/java/trunk Great. I just checked out the trunk: Checked out revision 151042. So much for the few minutes instead of hours, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Penalty for storing unrelated field?
On Friday 28 January 2005 22:30, Andy Goodell wrote: You should be fine. For search performance, yes. But the extra field data does slow down optimization of a modified index because all the field (and index) data is read and written for that. When the extra data gets bulky, it's normally better to store it in the file system or in a database. On Fri, 28 Jan 2005 15:21:50 -0600, Bill Tschumy [EMAIL PROTECTED] wrote: I just want to make sure that adding the unrelated field to a single doc won't cause all the other documents to increase their storage space. -- I have lots of fields that only occur in one document, but it doesn't phase lucene. Actually when choosing an indexing solution, we chose lucene mostly because of its ability to index and store unlimited kinds of metadata. - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestions for documentation or LIA
On Wednesday 26 January 2005 18:40, Ian Soboroff wrote: jian chen [EMAIL PROTECTED] writes: Just to continue this discussion. I think right now Lucene's retrieval algorithm is based purely on Vector Space Model, which is simple and efficient. As I understand it, it's indeed a tf-idf vector space approach, except that the queries are structured and as such, the tf-idf weights are totaled as a straight cosine among siblings of a BooleanQuery, but other query nodes may do things differently, for example, I haven't read it but I assume PhraseQueries require all terms present and adjacent to contribute to the score. There is also a document-specific boost factor in the equation which is essentially a hook for document things like recency, PageRank, etc etc. You can tweak this by defining custom Similarity classes which can say what the tf, idf, norm, and boost mean. You can also affect the term normalization at the query end in BooleanScorer (I think? through the sumOfSquares method?). We've implemented something kind of like the Similarity class but based on a model which decsribes a larger family of similarity functions. (For the curious or similarly IR-geeky, it's from Justin Zobel's paper from a few years ago in SIGIR Forum.) Essentially I need more general hooks than the Lucene Similarity provides. I think those hooks might exist, but I'm not sure I know which classes they're in. I'm also interested in things like relevance feedback which can affect term weights as well as adding terms to the query... just how many places in the code do I have to subclass or change? None. Create your own TermQuery instances, set their boosts, and add them to a BooleanQuery. It's clear that if I'm interested in a completely different model like language modeling the IndexReader is the way to go. In which case, what parts of the Lucene class structure should I adapt to maintain the incremental-results-return, inverted list skips, and other features which make the inverted search fast? To keep the speed, the one thing you should keep is the performance of TermQuery. In case you're interested in changing proximity scores, the same holds for SpanTermQuery. For a variation on TermQuery that scores query terms by their density in a document field you can have a look here: http://issues.apache.org/bugzilla/show_bug.cgi?id=31784 On top of these you can implement your own Scorers, but for Zobel's similarities you probably won't need much more than what BooleanQuery provides. To use the inverted list skips, make sure to implement and use skipTo() on your scorers. In case you need larger queries in conjunctive normal form: +(synA1 synA2 ) +(synB1 synB2 ...) +(synC1 synC2 ...) the development version of BooleanQuery might be a bit faster than the current one. For an interesting twist in the use of idf please search for fuzzy scoring changes on lucene-dev at the end of 2004. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Filtering w/ Multiple Terms
Jerry, On Monday 24 January 2005 18:26, Jerry Jalenak wrote: I spent some time reading the Lucene in Action book this weekend (great job, btw), and came across the section on using custom filters. Since the data that I need to use to filter my hit set with comes from a database, I thought it would be worth my effort this morning to write a custom filter that would handle the filtering for me. So, using the example from the book (page 210), I've coded an AccountFilter: public class AccountFilter extends Filter { public AccountFilter() {} public BitSet bits(IndexReader indexReader) throws IOException { System.out.println(Entering AccountFilter...); BitSet bitSet = new BitSet(indexReader.maxDoc()); String[] reportingAccounts = new String[] {0011, 4kfs}; int[] docs = new int[1]; int[] freqs = new int[1]; for (int i = 0; i reportingAccounts.length; i++) { String reportingAccount = reportingAccounts[i]; if (reportingAccount != null) { TermDocs termDocs = indexReader.termDocs(new Term(account, reportingAccount)); int count = termDocs.read(docs, freqs); if (count == 1) Unless account is a primary key fied, it's better to loop over the termdocs. { System.out.println(Setting bit on); bitSet.set(docs[0]); } } } System.out.println(Leaving AccountFilter...); return bitSet; } } I see where the AccountFilter is setting the cooresponding 'bits', but I end up without any 'hits': Entering AccountFilter... Entering AccountFilter... Entering AccountFilter... Setting bit on Setting bit on Setting bit on Setting bit on Setting bit on Leaving AccountFilter... Leaving AccountFilter... Leaving AccountFilter... I don't see any recursion in your code, but this output suggests nesting three deep. Something does not add up here. ... Found 0 matching documents in 1000 ms Can anyone tell me what I've done wrong? Maybe all query hits were filtered out? Could you compare the docnrs in the bits of the filter with the unfiltered query hits docnrs? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document 'Context' Relation to each other
You wouldn't even need the sequence number. You'll certainly be adding the documents to the index in the proper sequence already (right?). It is easy to random access documents if you know Lucene's document ids. Here's the pseudo-code - construct an IndexReader - open an IndexSearcher using the IndexReader - search, getting Hits back - for a hit you want to see the context, get hits.id(hit#) - subtract context size from the id, grab documents using reader.document(id) You don't search for a document by id, but rather jump right to it with IndexReader. Perfect, that's exactly what I was after! It's going to be easier than I thought. Thanks, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document 'Context' Relation to each other
As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. I've started creating a LoggingEvent-Document converter, and thinking through how I'd like this utility to work when I came across a question I wasn't sure about. When scanning/searching through logging events, one is usually looking for a particular matching event which Lucene does excellently, but what a person usually needs is also the context of that matching logging event around it. With grep, one can use the -CcontextSize argument to grep to provide X # of lines around the matching entry. I'd like to be able to do the same thing with Lucene. Now, I could provide a Field to the LoggingEvent Document that has a sequence #, and once a user has chosen an appropriate matching event, do another search for the documents with a Sequence # between +/- the context size. My question is, is that going to be an efficient way to do this? The sequence # would be treated as text, wouldn't it? Would the range search on an int be the most efficient way to do this? I know from the Hits documentation that one can retrieve the Document ID of a matching entry. What is the contract on this Document ID? Is each Document added to the Index given an increasing number? Can one search an index by Document ID? Could one search for Document ID's between a range? (Hope you can see where I'm going here). If you have any other recommendations about Context searching I would appreciate any thoughts. Many thanks for an excellent API, and kudos to Erik Otis for a great eBook btw. regards, Paul Smith - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Span Query Performance
On Thursday 06 January 2005 02:17, Andrew Cunningham wrote: Hi all, I'm currently doing a query similar to the following: for w in wordset: query = w near (word1 V word2 V word3 ... V word1422); perform query and I am doing this through SpanQuery.getSpans(), iterating through the spans and counting the matches, which can result in 4782282 matches (essentially I am only after the match count). The query works but the performance can be somewhat slow; so I am wondering: a) Would the query potentially run faster if I used Searcher.search(query) with a custom similarity, or do both methods essentially use the same mechanics It would be somewhat slower, because it loops over the getSpans() and computes document scores and constructs a Hits from the scores. b) Does using a RAMDirectory improve query performance any significant amount. That depends on your operating system, the size of the index, the amount of RAM you can use, the file buffering efficiency, other loads on the computer ... c) Is there a faster method to what I am doing I should consider? Preindexing all word combinations that you're interested in. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Span Query Performance
Sorry for the duplicate on lucene-dev, it should have gone to lucene-user directly: A bit more: On Thursday 06 January 2005 10:22, Paul Elschot wrote: On Thursday 06 January 2005 02:17, Andrew Cunningham wrote: Hi all, I'm currently doing a query similar to the following: for w in wordset: query = w near (word1 V word2 V word3 ... V word1422); perform query and I am doing this through SpanQuery.getSpans(), iterating through the spans and counting the matches, which can result in 4782282 matches (essentially I am only after the match count). The query works but the performance can be somewhat slow; so I am wondering: ... c) Is there a faster method to what I am doing I should consider? Preindexing all word combinations that you're interested in. In case you know all the words in advance, you could also index a helper word at the same position as each of those words. This requires a custom analyzer that inserts the helper word in the token stream with a zero position increment. The query then simplifies to: query = w near helperword which would probably speed things up significantly. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting index for DB indexing
Alternative: create a hashed value which is unique within your DB (e.g. use md5). Afterwards you can delete documents from the index with the IndexReader(Term). Without that additional field you can use the IndexSearcher to retrieve your documents from the index and then use IndexReader(DocNum) to delete these documents Paul On Thu, 30 Dec 2004 07:18:39 -0800 (PST), mahaveer jain [EMAIL PROTECTED] wrote: Hi All, I am using lucene for my DB indexing. I have 2 columns which are Keyword. Now I want to delete my index based on this 2 keyword. Is it possible ? If no. What is other alternative ? Thanks Mahaveer - Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser, default operator
Hi, the following code QueryParser qp = new QueryParser(itemContent, analyzer); qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND); Query query = qp.parse(line, itemContent, analyzer); doesn't produce the expected result because a query foo bar results in: itemContent:foo itemContent:bar where as a foo AND bar results in +itemContent:foo +itemContent:bar If I understand the default operator correctly than the first query should have been expanded to the same as the latter one, isn't it? thanks a lot! Paul P.S. I sent the mail yesterday as well, but I didn't see it in the mailinglist, I hope it doesn't appear twice now. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: document boost not showing up in Explanation
On Tuesday 28 December 2004 08:37, Erik Hatcher wrote: On Dec 27, 2004, at 9:54 PM, Vikas Gupta wrote: I am using lucene-1.4.1.jar(with nutch). For some reason, the effect of document boost is not showing up in the search results. Also, why is it not a part of the Explanation It actually is part of it Below is the 'explanation' of a sample query solar. I don't see the boost value (1.5514448) being used at all in the calculation of the document score - from the 'explanation' below and also from the quality of the search. How can I see the effect of document boost? Document boost is not stored in the index as-is. A single normalization factor is stored per-field and is computed at indexing type using field and document boosts, as well as the length normalization factor (and perhaps other factors I'm forgetting at the moment?). This also means that the explanation can only show a field normalisation factor as it is available from the index. One reason that boosting does necessarily not show up in the quality of the search is that the byte encoding allows only 256 different values to be stored. The value stored in the index (called the norm) is the product of the document boost factor, the field boost factor and the lengthNorm() of the field. For the search results to actually change because of the boost factors, it is necessary that this stored factor is changed to another one of the 256 possible. The range of possible values stored in the index is roughly from 7x10^9 to 2x10^-9 . See: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#setBoost(float) and http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#encodeNorm(float) The range of stored values (excluding the zero special case) is about 7x10^9 / 2x10^-9 = 3.5x10^18. The 10 log of that is about 18.5 . Per factor 10 there are about 255/18.5 = 13.8 encoded values. So, a minimum boost factor that should change a document score is about log(13.8)/log(10) = 1.14 . Since the default lengthNorm is the square root, a field length should change by at least the square of that (roughly a factor 1.3) to change the document score (assuming no hits in the changed field text.) Finally, a change in document score only influences the document ordering in the search results when another document has a score that is within the range of the change. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
QueryParser, default operator
Hi, the following code QueryParser qp = new QueryParser(itemContent, analyzer); qp.setOperator(org.apache.lucene.queryParser.QueryParser.DEFAULT_OPERATOR_AND); Query query = qp.parse(line, itemContent, analyzer); doesn't produce the expected result because a query foo bar results in: itemContent:foo itemContent:bar where as a foo AND bar results in +itemContent:foo +itemContent:bar If I understand the default operator correctly than the first query should have been expanded to the same as the latter one, isn't it? thanks a lot! Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Word co-occurrences counts
On Thursday 23 December 2004 07:50, [EMAIL PROTECTED] wrote: Hi all, I have a curious problem, and initial poking around with Lucene looks like it may only be able to half-handle the problem. The problem requires two abilities: 1.To be able to return the number of times the word appears in all the documents (which it looks like lucene can do through IndexReader) 2.To be able to return the number of word co-occurrences within the document set (ie. How many times does computer appear within 50 words of dog) Is the second point possible? You can use the standard query parser with a query like this: dog computer~50 This query is not completely symmetric in the distance computation: when computer occurs before dog, the allowed distance is 49, iirc. There is also a SpanNearQuery for more generalized and flexible distance queries, but this is not supported by the query parser, so you'll have to construct these queries in your own program code. In case you have non standard retrieval requirements, eg. you only need the number of hits and no further information from the matching documents, you may consider using your own HitCollector on the lower level search methods. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
On Thursday 23 December 2004 08:13, Gururaja H wrote: Hi Chuck Williams, Thanks much for the reply. If your queries are all BooleanQuery's of TermQuery's, then this is very simple. Iterate down the list of BooleanClause's and count the number whose score is 0, then divide this by the total number of clauses. Take a look at BooleanQuery.BooleanWeight.explain() as it does this (along with generating the rest of the explanation). If you support the full Lucene query language, then you need to look at all the query types and decide what exactly you want to compute (as coord is not always well-defined). We are supporting full Lucene query language. My request is, assuming queries are all BooleanQuery please post the implementation source code for the same. ie to calculate the coord() method input parameters overlap and maxOverlap. I don't have the code, but I can give an overview of possible steps: First inherit from BooleanScorer to implement a score() method that returns only the coord() value (preferably a precomputed one). Then inherit from BooleanQuery.BooleanWeight to return the above Scorer. Then inherit from BooleanQuery to use the above Weight in createWeight(). Then inherit from QueryParser to use the above Query in getBooleanQuery(). Finally use such a query in a search: the document scores will be the coord() values. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: index size doubled?
On Tuesday 21 December 2004 05:49, aurora wrote: I'm testing the rebuilding of the index. I add several hundred documents, optimize and add another few hundred and so on. Right now I have around 7000 files. I observed after the index gets to certain size. Everytime after optimize, the are two files roughly the same size like below: 12/20/2004 01:57p 13 deletable 12/20/2004 01:57p 29 segments 12/20/2004 01:53p 14,460,367 _5qf.cfs 12/20/2004 01:57p 15,069,013 _5zr.cfs The index total index is double of what I expect. This is not always reproducible. (I'm constantly tuning my program and the set of document). Sometime I get a decent single document after optimize. What was happening? Lucene tried to delete the older version (_5cf.cfs above), but got an error back from the file system. After that it has put the name of that segment in the deletable file, so it can try later to delete that segment. This is known behaviour on FAT file systems. These randomly take some time for themselves to finish closing a file after it has been correctly closed by a program. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MergerIndex + Searchables
Karthik, On Tuesday 21 December 2004 09:04, Karthik N S wrote: Hi Guys Apologies... I have several MERGERINDEXES [ MGR1,MGR2,MGR3]. for searching across these MERGERINDEXES I use the following Code IndexSearcher[] indexToSearch = new IndexSearcher[CNTINDXDBOOK]; for(int all=0;allCNTINDXDBOOK;all++){ indexToSearch[all] = new IndexSearcher(INDEXEDBOOKS[all]); System.out.println(all + ADDED TO SEARCHABLES + INDEXEDBOOKS[all]); } MultiSearcher searcher = new MultiSearcher(indexToSearch); Question : When on Search Process , How to Display that this relevan Document Id Originated from Which MRG??? [ Some thing like this : - Search word 'ISBN12345' is avalible from MRGx ] I think you are looking for the methods subSearcher() and subDoc() on MultiSearcher. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Optimising A Security Filter
On Sunday 19 December 2004 23:05, Steve Skillcorn wrote: Hello All; I bought the Lucene in Action ebook, which is excellent and I can strongly recommend. One question that has arisen from the book though is custom filters. I have the situation where the text of my docs is in Lucene, but the permissions are in my RDBMS. I can write a filter (in fact have done so) that loops through the documents in the passed IndexReader and queries the DB to detect if the user is permissioned for them, setting the relevant BitSet. My results are then paged ( last | next ) to a web page. Does the IndexReader that is passed to the bits method of the filter represent the entire index, or just the results that match the query? The IndexReader represents the entire index. Is not worrying about filters and simply checking the returned Hit List before presenting a sensible approach? That's is done by the IndexSearcher.search() methods that take a filter argument. I can see the point to filters as presented in the Lucene in Action ISBN example, but are they a good approach where they could end up laboriously marking the entire index as True? The filter is checked only for search results on the query over the whole index. The bit filters generally work well, except when you need a lot of very sparse filters and memory is a concern. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Relevance percentage
On Monday 20 December 2004 15:09, Gururaja H wrote: Hi, But, How to calculate the coord() fraction ? I know by default, in DefaultSimilarity the coord() fraction is defined as below: /** Implemented as codeoverlap / maxOverlap/code. */ public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; } How to get the overlap and maxOverlap value in each of the matched document(s) ? In case you only want the coordination factor to have more influence in the order of your search results you can use a Similarity with a coord() function that has a power higher than 1: public float coord(int overlap, int maxOverlap) { return (float) Math.pow((overlap / (float)maxOverlap), SOME_POWER); } I'd first try values between 3.0f and 5.0f for SOME_POWER. The searching code precomputes all coord values once per query per search, so there is no need to worry about the computing efficiency. This has the advantage that the other scoring factors are still used for ranking. Since the other factors can vary quite a bit, it is difficult to guarantee that any coord() implementation will provide a score that sorts by the number of matching clauses. Higher powers as above can come a long way, though. Regards, Paul Elschot Thanks, Gururaja Mike Snare [EMAIL PROTECTED] wrote: I'm still new to Lucene, but wouldn't that be the coord()? My understanding is that the coord() is the fraction of the boolean query that matched a given document. Again, I'm new, so somebody else will have to confirm or deny... -Mike On Mon, 20 Dec 2004 00:33:21 -0800 (PST), Gururaja H wrote: How to find out the percentages of matched terms in the document(s) using Lucene ? Here is an example, of what i am trying to do: The search query has 5 terms(ibm, risc, tape, dirve, manual) and there are 4 matching documents with the following attributes: Doc#1: contains terms(ibm,drive) Doc#2: contains terms(ibm,risc, tape, drive) Doc#3: contains terms(ibm,risc, tape,drive) Doc#4: contains terms(ibm, risc, tape, drive, manual). The percentages displayed would be 100%(Doc#4), 80%(doc#2), 80%(doc#3) and 40% (doc#1). Any help on how to go about doing this ? Thanks, Gururaja - Do you Yahoo!? Send a seasonal email greeting and help others. Do good. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? All your favorites on one personal page Try My Yahoo! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Permissioning Documents
On Friday 10 December 2004 07:10, Steve Skillcorn wrote: Hi; I'm currently using Lucene (which I am extremely impressed with BTW) to index a knowledge base of documents. One issue I have is that only certain documents are available to certain users (or groups). The number of documents is large, into the 100,000s, and the number of uses can be into the 1000s. Obviously, the users permissioned to see certain documents can change regularly, so storing the user id's in the Lucene document is undesirable, as a permission change could mean a delete and re-add to potentially 100s of documents. Does anyone have any guidance as to how I should approach this? A typical solution would be to use a Filter for each user group. Each Filter would be built from categories indexed with the documents. The moment to build a group Filter could be the first time a user from a group queries an index after it is opened. Filters can be cached, see the recent discussion on CachingWrappingFilter and friends. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Retrieving all docs in the index
On Thursday 09 December 2004 21:18, Ravi wrote: That was exactly my original question. I was wondering if there are alternatives to this approach. In case you need only a few of the top ranking documents, and the documents are to be sorted by date anyway, you might consider to search each of the dates in sorted order separately until you have enough results. In that way there is no need to use a field with some constant value. Nonetheless, I can recommend to have a special field containing all the field names for a document. As all docs normally contain a primary key, the name of the primary key field can serve as the constant value. Regards, Paul Elschot -Original Message- From: Aviran [mailto:[EMAIL PROTECTED] Sent: Thursday, December 09, 2004 2:08 PM To: 'Lucene Users List' Subject: RE: Retrieving all docs in the index In this case you'll have to add another field with a fixed value to all the documents and query on that field Aviran http://www.aviransplace.com -Original Message- From: Ravi [mailto:[EMAIL PROTECTED] Sent: Thursday, December 09, 2004 14:04 PM To: Lucene Users List Subject: RE: Retrieving all docs in the index I'm sorry I don't think I articulated my question well. We use a date filter to sort the search results. This works fine when te user provides some search criteria. But if he gives an empty search criteria, we need to return all the documents in the index in the given date range sorted by date. So I was looking for a query that returns me all documents in the index and then I want to apply the date filter on it. -Original Message- From: Erik Hatcher [mailto:[EMAIL PROTECTED] Sent: Thursday, December 09, 2004 1:55 PM To: Lucene Users List Subject: Re: Retrieving all docs in the index On Dec 9, 2004, at 1:35 PM, Ravi wrote: Is there any other way to extract all documents from an index apart from adding an additional field with the same value to all documents and then doing a term query on that field with the common value? Of course. Have a look at the IndexReader API. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene in action ebook
synchronized(luceneEbook){ luceneEbook.wait(); } Just waiting for the notifyAll() Kevin A. Burton wrote: Erik Hatcher wrote: I have the e-book PDF in my possession. I have been prodding Manning on a daily basis to update the LIA website and get the e-book available. It is ready, and I'm sure that its just a matter of them pushing it out. There may be some administrative loose ends they are tying up before releasing it to the world. It should be available any minute now, really. :) Send off a link to the list when its out... We're all holding our breath ;) (seriously) Kevin -- *Paul Smith *Software Architect *Aconex * 31 Drummond Street, Carlton, VIC 3053, Australia *Tel: +61 3 9661 0200 *Fax: +61 3 9654 9946 Email: [EMAIL PROTECTED] www.aconex.com** This email and any attachments are intended solely for the addressee. The contents may be privileged, confidential and/or subject to copyright or other applicable law. No confidentiality or privilege is lost by an erroneous transmission. If you have received this e-mail in error, please let us know by reply e-mail and delete or destroy this mail and all copies. If you are not the intended recipient of this message you must not disseminate, copy or take any action in reliance on it. The sender takes no responsibility for the effect of this message upon the recipient's computer system.** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: restricting search result
Paul, On Friday 03 December 2004 23:31, you wrote: Hi, how yould you restrict the search results for a certain user? I'm One way to restrict results is by using a Filter. indexing all the existing data in my application but there are certain access levels so some users should see more results then an other. Each lucene document has a field with an internal id and I want to restrict on that basis. I tried it with adding a long concatenation of my ids (+locationId:1 +locationId:3 + ...) but this throws a More than 32 required/prohibited clauses in query. exception. Any suggestions? Using a + before each term requires them all, ie. uses AND, which would normally have an empty result for an Id field. You might prefer this query concatenation: +(locationId:1 locationId:3 ...) It effectively OR's the locationId content query and requires only one of the terms to match. In this case using a Filter would probably be better, though. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: restricting search result
The thing with the different indexes sound too complecated because the users (and their rights) as well as the index itself change quite often. One way to restrict results is by using a Filter. but a filter is applied after the whole search is performed, isn't it? I thought it might be faster to restrict the search space in advance Using a + before each term requires them all, ie. uses AND, which would normally have an empty result for an Id field. d'oh, yes of course.. You might prefer this query concatenation: +(locationId:1 locationId:3 ...) ok, that sounds very nice and works fine. But I will have a closer look at the filter as well. Thank you all Paul P.S. someone without gmail account? mail me - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: restricting search result
On Saturday 04 December 2004 15:44, Erik Hatcher wrote: On Dec 4, 2004, at 6:44 AM, Paul wrote: One way to restrict results is by using a Filter. but a filter is applied after the whole search is performed, isn't it? Incorrect. A filter is applied *before* the search truly occurs - in other words it reduces the search space. Currently a filter is applied during search, after the document score is computed, but before a document is added to the search results. In practice, the score computation is much less work than the I/O, so a filter does reduce the search space. A filter might also be used to reduce the I/O for searching, but Lucene doesn't do that now, probably because there was little to gain. Regards, Paul Elschot. P.S. The code doing the filtering is in IndexSearcher.java, from line 97. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter.optimize and memory usage
On Friday 03 December 2004 08:43, Paul Elschot wrote: On Friday 03 December 2004 07:50, Chris Hostetter wrote: ... So, If I'm understanding you (and the javadocs) correctly, the real key here is maxMergeDocs. It seems like addDocument will never merge a segment untill maxMergeDocs have been added? ... meaning that I need a value less then the default (Integer.MAX_VALUE) if I want IndexWriter to do incrimental merges as I go ... ...except... ...if that were the case, then exactly is the meaning of mergeFactor? oops correction=minMergeDocs should be replaced by mergeFactor: maxMergeDocs controls the sizes of the intermediate segments when adding documents. With maxMergeDocs at default, adding a document can take as much time as : (and have the same effect as) optimize. Eg. with mergeFactor at 10, the 1000'th added document will create a segment of size 1000. With maxMergeDocs at a lower value than 1000, the last merge (of the 10 segments with 100 docs each) will not be done. : optimize() uses mergeFactor for its final merges, but it ignores maxMergeDocs. /oops Meanwhile these fields have been deprecated in the development version for set... methods. Setting minMergeDocs is is deprecated and to be replaced by setMaxBufferedDocs(). The javadoc for this reads: Determines the minimal number of documents required before the buffered in-memory documents are merging and a new Segment is created. Since Documents are merged in a RAMDirectory, large value gives faster indexing. At the same time, mergeFactor limits the number of files open in a FSDirectory. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
restricting search result
Hi, how yould you restrict the search results for a certain user? I'm indexing all the existing data in my application but there are certain access levels so some users should see more results then an other. Each lucene document has a field with an internal id and I want to restrict on that basis. I tried it with adding a long concatenation of my ids (+locationId:1 +locationId:3 + ...) but this throws a More than 32 required/prohibited clauses in query. exception. Any suggestions? thx! Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Does Lucene perform ranking in the retrieved set?
On Tuesday 30 November 2004 18:46, Xiangyu Jin wrote: THis might be a stupid question. When perform retrieval for a query, deos Lucene first get a subset of candidate matches and then perform the ranking on the set? That is, similarity calculation is performed only on a subset of the docuemnts to the query. Yes, Lucene uses an inverted index for this. If so, from which module could I get those candidate docs, then I can perform my own similarity calculations (since I might need to rewrite the normalization factor, so only modify the similarity model seems will not work). To change the normalisation you may consider implementing your own Weight: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Weight.html For some example implementations of Weight the Lucene source code in the org.apache.lucene.search package is the best resource. Using your own Weight also requires a subclass of Query that returns this weight in the createWeight() method. Or, is there document describe the produre of how Lucene perform search? This describes the scoring: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html See also the DefaultSimilarity. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Wednesday 24 November 2004 01:31, Ken McCracken wrote: Hi, Thanks the pointers in your replies. Would it be possible to include some sort of accrual scorer interface somewhere in the Lucene Query APIs? This could be passed into a query similar to MaxDisjunctionQuery; and combine the sum, max, tieBreaker, etc., according to the implementor's discretion, to compute the overall score for a document. The DisjunctionScorer is currently not part of Lucene. You might try and subclass Similarity to provide what you need and pass that to your Query. I'm using a few subclasses of DisjunctionScorer to provide the actual score value ao. for max and sum. For each of these scorers, I use a separate Query and Weight. This gives a parallel class hierarchy for Query, Weight and Scorer. I guess it's time to have a look at Design Patterns and/or Refactoring on how to get rid of the parallel class hierarchy. That could also involve some sort of accrual scorer and Lucene's Similarity. Regards, Paul Elschot -Ken On Sat, 13 Nov 2004 12:07:05 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric Range Restrictions: Queries vs Filters
Chris, On Tuesday 23 November 2004 03:25, Hoss wrote: (NOTE: numbers in [] indicate Footnotes) I'm rather new to Lucene (and this list), so if I'm grossly misunderstanding things, forgive me. One of my main needs as I investigate Search technologies is to restrict results based on Ranges of numeric values. Looking over the archives of this list, it seems that lots of people have run into problems dealing with this. In particular, whenever someone asks a question about Numeric Ranges the question seem to always involve one (or more) of the following: (a) Lexical sorting puts 11 in the range 1 TO 5 (b) Dates (or Dates and Times) (c) BooleanQuery$TooManyClauses Exceptions (d) Should I use a filter? FWIW, the javadoc of the development version of BooleanQuery.maxClauseCount reads: The maximum number of clauses permitted. Default value is 1024. Use the org.apache.lucene.maxClauseCount system property to override. TermQuery clauses are generated from for example prefix queries and fuzzy queries. Each TermQuery needs some buffer space during search, so this parameter indirectly controls the maximum buffer requirements for query search. Normally the buffers are allocated by the JVM. When using for example MMapDirectory the buffering is left to the operating system. MMapDirectory uses memory mapped files for the index. It would be useful to also provide a reference to filters (DateFilter) and to LongField in case it is added to the code base. ... The Query API on the other hand ... I freely admit, that I can't make heads or tails out of it. I don't even know where I would begin to try and write a new subclass of Query if I wanted to. In a nutshell: A Query either rewrites to another Query, or it provides a Weight. A Weight first does normalisation and then provides a Scorer to be used during search. RangeQuery is a good example: A RangeQuery rewrites to a BooleanQuery over TermQuery's for the matching terms. A BooleanQuery provides a BooleanScorer via its Weight. A TermQuery provides a TermScorer via its Weight. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
experiences with PDF files
Hi, I read a lot of mails about the time consuming pdf-parsing and tried myself some solutions. My example PDF file has 181 pages in 1,5 MB (mostly text nearly no grafics). -with pdfbox.org's toolkit it took 17m32s to parseread it's content -after installing ghostscript and ps2text / ps2ascii my parsing failed after page 54 and 2m51s because of irregular fonts -installing XPDF and using it's tool pdftotext parsing completed after 7-10seconds My machine is a Celeren 1700 with VMWare Workstation 3.2 (128 MB assigned) and Linux Suse 7.3. I will parse my pdf files with xpdf and something like Runtime.getRuntime().exec(pdftotext -nopgbrk -raw +pdfFileName+ +txtFileName); Paul P.S. look at http://www.jguru.com/faq/view.jsp?EID=1074237 for links and tipps - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
retrieving added document
Hi, I'm creating a document and adding it with a writer to the index. For some reason I need to add data to this specific document later on (minutes, not hours or days). Is it possible to retrieve it and add additonal data? I found the document(int n) - method within the IndexReader (btw: the description makes no sense for me: Returns the stored fields of the nth Document in this index. - but it returns a Document and not a list of fields..) but where do I get that number from? (and the numbers change, I know..) thanks for any help Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using multiple analysers within a query
On Monday 22 November 2004 05:02, Kauler, Leto S wrote: Hi Lucene list, We have the need for analysed and 'not analysed/not tokenised' clauses within one query. Imagine an unparsed query like: +title:Hello World +path:Resources\Live\1 In the above example we would want the first clause to use StandardAnalyser and the second to use an analyser which returns the term as a single token. So a parsed result might look like: +(title:hello title:world) +path:Resources\Live\1 Would anyone have any suggestions on how this could be done? I was thinking maybe the QueryParser would have to be changed/extended to accept a separator other than colon :, something like = for example to indicate this clause is not to be tokenised. Or perhaps this can all be done using a single analyser? Overriding QueryParser.getFieldQuery() might work for you. It is given the field and the query text so an analyzer can be chosen depending on the field. In case you don't use the latest cvs head, it may be worthwhile to have a look. Some of the getFieldQuery methods have been deprecated, but I don't know when. Regards, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and SVD
On Wednesday 17 November 2004 23:57, DES wrote: Hi I need some kind of implementation of SVD (singular value decomposition) or LSI with Lucene engine. Have anyone any ideas how to create a query table for decomposition? The table must have documents as rows and terms as columns, if a term is presented in the docuement, the corresponding field contains 1 and a 0 if not. Then the SVD will be applied to this table, From Lucene, with TermVector and field norm, one could use the term density instead of a presence bit. and with first 2 columns docuemnts will be displayed in a 2D-space. Does anyone work on a project like this? I don't know. Is there a good SVD package for Java? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boolean/set operations on lucene queries
On Thursday 18 November 2004 16:57, Rupinder Singh Mazara wrote: hi all I needed some help in solving the following problem a user executes query1 and query2 both the queries( not result sets ) get stored, over time the user wants to find which documents from query1 are common to documents in query2 , basicall a intersect of the results of query1 with query2 and similarly the union and difference between the results of query1 and query2 without having to run the queries and storing the results into some kind of datastructure does lucene provide some capabilities, i was reading about QueryFilter, The queries can be added as clauses to a BooleanQuery. Such clauses can be optional, required or prohibited. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need help with filtering
On Wednesday 17 November 2004 01:20, Edwin Tang wrote: Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed around the FAQs and archives and see that I can either use QueryFilter or BooleanQuery. I've tried both approaches to limit the document ID range, but am getting the BooleanQuery.TooManyClauses exception in both cases. I've also tried bumping max number of clauses via setMaxClauseCount(), but that number has gotten pretty big. Is there another approach to this? ... Recoding DateFilter to a DocumentIdFilter should be straightforward. The trick is to use only one document enumerator at a time for all terms. Document enumerators take buffer space, and that is the reason why BooleanQuery has an exception for too many clauses. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: COUNT SUBINDEX [IN MERGERINDEX]
On Wednesday 17 November 2004 07:10, Karthik N S wrote: Hi guy's Apologies. So A Mergeed Index is again a Single [ addition of subIndexes... ), If that case , If One of the Field Types is of type 'Field.Keyword' whic is Unique across the subIndexes [Before Merging]. and If I want to Count this Unique Field in a MergerIndex [After i'ts been Merged ] How do I do this Please. IndexReader.numDocs() will give the number of docs in an index. Lucene has no direct support for unique fields. After merging, if the same unique field value occurs in both source indexes, the merged index will contain two documents with that value. In case one wants to merge into unique field values, the non unique values in one of the source indexes need to be deleted before merging. See IndexReader.termDocs(term) on how to get the document numbers for (unique) terms via a TermDocs, and IndexReader.delete(docNum) for deleting docs. Regards, Paul. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: BooleanQuery - TooManyClauses Issue
On Tuesday 16 November 2004 21:35, Joe Krause wrote: Hey Folks, I just inherited a deployed Lucene based application that started throwing the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses ... I did some research regarding this error and found out that the default number of clauses a BooleanQuery can contain are 1024 (a limitation, but one that seems reasonable to work with). I outputted the contents of the org.apache.lucene.search.Query object and the org.apache.lucene.search.Sort objects right before I sent them to the org.apache.lucene.search.IndexSearcher - to see if there are too many clauses being accidentally produced. This is what I get: 2004-11-16 12:09:40,302 DEBUG com.multivision.util.search.HitIndex - Query = +(affiliate:teeth market:teeth dma_rank:teeth program:teeth station:teeth text:teeth) +air_date:[040101 TO 0411162359] 2004-11-16 12:09:40,302 DEBUG com.multivision.util.search.HitIndex - Sort = air_date!,dma_rank So there appears to be far fewer than 1024 clauses. Is there any other reasons why I would be getting this exception? I am new to Lucene, so at this point I am stumped. The range query: +air_date:[040101 TO 0411162359] is almost certainly causing your problems. It expands further to all terms in the range. Several solutions to this have been discussed earlier, ao. splitting dates into day and time components. Once you approach 1000 days, you'll get the same problem again, so you might want to use a filter for the dates. See DateFilter and the archives on MMDD. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Friday 12 November 2004 22:56, Chuck Williams wrote: I had a similar need and wrote MaxDisjunctionQuery and MaxDisjunctionScorer. Unfortunately these are not available as a patch but I've included the original message below that has the code (modulo line breaks added by simple text email format). This code is functional -- I use it in my app. It is optimized for its stated use, which involves a small number of clauses. You'd want to improve the incremental sorting (e.g., using the bucket technique of BooleanQuery) if you need it for large numbers of clauses. When you're interested, you can also have a look here for yet another DisjunctionScorer: http://issues.apache.org/bugzilla/show_bug.cgi?id=31785 It has the advantage that it implements skipTo() so that it can be used as a subscorer of ConjunctionScorer, ie. it can be faster in situations like this: aa AND (bb OR cc) where bb and cc are treated by the DisjunctionScorer. When aa is a filter this can also be used to implement a filtering query. Re. Paul's suggested steps below, I did not integrate this with query parser as I didn't need that functionality (since I'm generating the multi-field expansions for which max is a much better scoring choice than sum). Chuck Included message: -Original Message- From: Chuck Williams [mailto:[EMAIL PROTECTED] Sent: Monday, October 11, 2004 9:55 PM To: [EMAIL PROTECTED] Subject: Contribution: better multi-field searching The files included below (MaxDisjunctionQuery.java and MaxDisjunctionScorer.java) provide a new mechanism for searching across multiple fields. The maximum indeed works well, also when the fields differ a lot length. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Saturday 13 November 2004 09:16, Sanyi wrote: - leave the current implementation, raising an exception; - handle the exception and limit the boolean query to the first 1024 (or what ever the limit is) terms; - select, between the possible terms, only the first 1024 (or what ever the limit is) more meaningful ones, leaving out all the others. I like this idea and I would finalize to myself like this: I'd also create a default rule for that to avoid handling exceptions for people who're happy with the default behavior: Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. Wouldn't it be counterintuitive to only use the longest matches for truncations? To have only longer matches one can also use queries with multiple ? characters, each matching exactly one character. I think it would be better encourage the users to use longer and maybe also more prefixes. This gives more precise results and is more efficient to execute. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
On Friday 12 November 2004 07:57, Sanyi wrote: That's the point: there is no query optimizer in Lucene. Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your the viewpoint of a user. You know my users aren't technicians, so answers like yours won't make them happy. They will only see that I randomly don't allow them to search (with the 1024 limit). They won't understand why am I displaying Please restrict your search a bit more.. when they've just searched for dodge AND vip* and there are only a few documents mathcing this criteria. So, is the only way to make them able to search happily by setting the max. clause limit to MaxInt?! The problem is that there is a lot of freedom in choosing a query, but there is a limited amount of resources available to search each query. It is normally possible to reduce the numbers of such complaints a lot by imposing a minimum prefix length and eg. doubling or tripling the max. nr. of clauses. This reduces the freedom of the users because their queries must be (a bit) more specific. The actual tradeoff depends on the user requirements and the time and memory available on the server, so the users get what they pay for. Imposing a minimum prefix length can be done by overriding the method in QueryParser that provides a prefix query. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene Scorers
On Friday 12 November 2004 20:48, Ken McCracken wrote: Hi, I am looking at the Similarity class overview, and wondering if I can replace the SUM operator with a MAX operator, or any other operator (across the terms in a query). For example, if I search for car OR automobile, a BooleanScorer is used to add the values from each subexpression together. In the BooleanScorer from lucene_1_4_final, in the inner class Collector, we have in the collect(...) method, the line bucket.score += score; // increment score that I may want replace with a MAX operator such as if (score bucket.score) bucket.score = score;// take the max I may also want to keep track of both the max and the sum, by extending the inner class Bucket. Do you have any suggestions on how to implement such a change? Ideally, I would like to have the ability to define my choice of scoring algorithm at search time (at run time), and use the Lucene SUM scorer for some searches, and the MAX scorer for other searches. Thanks for you help. -Ken PS. The code I'm talking about falls in the follwoing area, for my example search car OR automobile. If I walk the code during search, I see that the BooleanScorer$Collector is created by the Weight that was just created, in BooleanQuery$BooleanWeight.scorer(...), as it adds the subscorers for each of the terms in the BooleanScorer. When that collector is asked to collect(...), its bucketTable is filled in. Since the collectors for each of the terms use the same bucketTable, if the document already appears in the bucketTable, then it's score is added to implement a SUM operator. SInce you are that far already, you can (in reverse order): - replace the BooleanScorer by another one that takes the max instead of summing. - replace the weight to return that scorer. - replace the BooleanQuery to return that weight. - override QueryParser.getBooleanQuery() to return that query in the cases you want, that is when all clauses are optional. replace usually means inherit from in new code. When you need more info on this, try lucene-dev. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query#rewrite Question
On Thursday 11 November 2004 03:51, Satoshi Hasegawa wrote: Hello, Our program accepts input in the form of Lucene query syntax from the user, but we wish to perform additional tasks such as thesaurus expansion. So I want to manipulate the Query object that results from parsing. My question is, is the result of the Query#rewrite method guaranteed to be either a TermQuery, a PhraseQuery, or a BooleanQuery, and if it is a BooleanQuery, do all the constituent clauses also reduce to one of the above three classes? If not, what if the original Query object was the one that was obtained from QueryParser#parse method? Can I assume the above in this restricted case? I experimented with the current version, and the above seems to be positive in this version; I'm asking if this could change in the future. Thank you. In general, a Query should either rewrite to another query, or provide a Weight. During search, the Weight then provides a Scorer to score the docs. The only other type of query currently available is SpanQuery, which is a generalization of PhraseQuery. It does not rewrite and provides a Weight. However, the current QueryParser does not have support for SpanQuery. So, as long as the QueryParser does not support more than the current types of queries, and you only use the QueryParser to obtain queries, all the constituent clauses will reduce as you indicate above. SpanQuery could be useful for thesaurus expansion. The generalization it provides is that it allows nested distance queries. For example, in: word1 word2~2 word2 can expanded to: word2 or word3 word4~4 leading to a query that is not supported by the current QueryParser: word1 (word 2 or word3 word4~4)~2 SpanQueries can also enforce an order on the matching subqueries, but that is difficult to express in the current query syntax. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
Luke, On Tuesday 09 November 2004 20:58, you wrote: Hi, I've implemented a converter to translate our system's internal Query objects to Lucene's Query model. I recently realized that my implementation of OR NOT was not working as I would expect and I was wondering if anyone on this list could give me some advice. Could you explain OR NOT ? Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. Lucene's - prefix in queries means AND NOT, ie. the term with the - prefix prohibits the matching of a document. I am converting a query that means foo or not bar into the following: +item_type:xyz +(field_name:foo -field_name:bar) This returns only Documents where field_name contains foo. I would expect it to return all the Documents where field_name contains foo or field_name doesn't contain bar. Fiddling around with the Lucene Index Toolbox, I think that this query does what I want: +item_type:xyz field_name:foo -field_name:bar Can someone explain to me why these queries return different results? A bit dense, but anyway: Anything prefixed with + is required. Anything not having + or - prefix is optional and only influences the score. In case there is nothing required by a + prefix, at least one of the things without prefix is required. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: can lucene be backed to have an update field
Chris, On Tuesday 09 November 2004 22:54, Chris Fraschetti wrote: Is it possible to modify the lucene source to create an updateDocument(doc#, FIELD, value) function ? It's possible, but an implementation would not be efficient when the field is indexed. The current index structure has no room to spare for insertions, and no provision for deleted terms. Some time ago an extra level was added in the index for skipping ahead more efficiently. Perhaps that could be combined with a gap for insertions. But when such a gap would fill up there would again be no choice but to delete and add the changed document. Also adding a document without optimizing is quite efficient already, so there is probably not much interest in adding such gaps. In case the field is stored only and the value would have the same length as the currently stored value it would be possible to replace the value efficiently. The only updates available are on the field norms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the difference between these searches?
On Tuesday 09 November 2004 23:14, Luke Francl wrote: On Tue, 2004-11-09 at 16:00, Paul Elschot wrote: Lucene has no provision for matching by being prohibited only. This can be achieved by indexing something for each document that can be used in queries to match always, combined with something prohibited in a query. But doing this is bad for performance for querying larger nrs of docs. I'm familiar with Lucene's restrictions on prohibited queries, and I have a required clause for a field that will always be part of the query (it's not a nonsense value, it's the item type of the object in a CMS). That might also be mapped to a filter. My problem is that I have been considering the whole query object that I've generated. Every BooleanQuery that's a part of my finished query must also have a required clause if it has a prohibited clause. I'm thinking of refactoring my code so that instead of joining together Query objects into a large BooleanQuery, it passes around BooleanClauses and assembles them into a single BooleanQuery. It may not be possible to flatten a boolean query to a single level, eg: (+aa +bb) (+cc +dd) +(a1 a2) +(b1 b2) These will generate nested BooleanQuery's iirc. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
On Monday 01 November 2004 21:02, Jeff Munson wrote: I'm looking for tips on speeding up searches since I am a relatively new user of Lucene. I've created a single index with 4.5 million documents. The index has about 22 fields and one of those fields is the contents of the body tag which can range from 5K to 35K. When I create the field (named contents) that houses the contents of the body tag, the field is stored, indexed, and tokenized. The term position vectors are not stored. Single word searches return pretty fast, but when I try phrases, searching seems to slow considerably. When constructing the query I am using the standard query object where analyzer is the StandardAnalyzer: Code Example: Query objQuery = QueryParser.parse(sSearchString, contents, analyzer); For example, the following query, contents:Zanesville, it returns over 163,000 hits in 78 milliseconds. However, if I use this query, contents:all parts including picture tube guaranteed, it returns hits in 2890 millseconds. Other phrases take longer as well. My question is, are there any indexing tips (storing term vectors?) or query tips that I can use to speed up the searching of phrases? Term vectors should not influence search times for phrases. What you're seeing is this: for each term in your query Lucene has to walk all the documents containing the term. For a single term there is no speed problem because the document set for the term is stored in a compact way on disk. For multiple terms with large document sets the disk head needs to move between the document sets of the terms because all sets need to be walked synchronously over the documents to compute the document scores. For phrases even more disk accesses are needed to access the term positions within the documents. Normally the disk head seeks are degrading the performance. One way to avoid the disk head seeks is to use fewer terms in the phrases. Another way is to avoid using the term positions by querying for words instead of phrases. In case you have hardware/resources there are more options like using faster disks and/or using RAM for critical parts of the index. Lucene can use extra RAM in various ways. To configure that one may have to do some java coding. Profiling can guide you there. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search speed
On Tuesday 02 November 2004 17:50, Jeff Munson wrote: Thanks for the info Paul. The requirements of my search engine are that I need to search for phrases like death notice or world war ii. You suggested that I break the phrases into words. Is there a way to break the phrases into words, do the search, and just return the documents with the phrase? I'm just looking for a way to speed up the phrase searches. If you know the phrases in advance, ie. before indexing, you can index and search them as terms with a special purpose analyzer. It's an unusual solution, though. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: When do document ids change
Justin, On Friday 29 October 2004 20:48, you wrote: Given an FSDirectory based index A. Documents are added to A with an IndexWriter minMergeDocs = 2 mergeFactor = 3 Documents are never deleted. Once the RAMDirectory merges documents to the index: a) will the documentID values for index A ever change? A document id may change after deleting a document that was added earlier than the document. Adding more docs may then change the id. Optimizing the index will then change the id. b) can a mapping between a term in the document and newly created documentID be made? Yes. See below on how. Why I am asking this question: I have a database with about 10M rows in it. My search engine needs to be able to quickly get all the rows back from the database that match a query. All the rows need to be returned at once, because the entire result set is sorted based on user input. Did you try IndexSearcher.search() or Search.search() with a Sort argument? What I want to do: When a documentID gets assigned to a document, I want to update the database row with that matches the document field id with the lucene documentID. That way, I can use a hitcollector to gather just the documentID values from the search and insert them into a temporary cache table, then grab the matching rows from the database. This will work assuming the documentID values for the given document never change. It will work on the condition that documents are never (in the absolute sense) deleted from the lucene index, and that one never merges indexes. Currently, running an IndexSearcher.search() and getting all the rows back takes between 5 and 30 seconds for most queries, which is certainly not fast enough. The time it takes to collect the documentIDs however is less than 1 second. All the time is taken by calling hits.doc() for each document to get the id field to insert into the database. One can speed up retrieving data from Lucene indexes by retrieving in the order of docId, via indexReader.document(docId). Make sure no other threads are using the index at the same time. One can also store the Lucene files with the stored fields on another disk, but for that some coding is needed. You may have to implement your own HitCollector. Lucene does not guarantee that the hits are collected in order of docId, but the collecting order is normally not far off. So finally, will what I want to do work, and if so, how can I go It will work, but I would not recommend it. Just retrieve what you need from the Lucene index in the order of the docId's. Try and store as little data per document as possible. about updating the database when the documentID is created? To know the docId use an indexed primary key in lucene and search for it using IndexReader.termDocs(new Term(keyField, keyValue)). Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what Word/Excel/PowerPoint lib to use?
At 17:05 25/10/2004, you wrote: of course POI, for open source. There are some commercial products based on POI also. for WORD consider textmining.org for XLS, POI does anything you need for powerpoint there is one commercial (it's about 1000$), but you can also find some source code in archives. And what do you think about using Open Office's UNO APIs ? If someone did, does it scale well ? (I just did some unit testing ) Jean-Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Need advice: what Word/Excel/PowerPoint lib to use?
At 19:42 25/10/2004, you wrote: At 17:05 25/10/2004, you wrote: of course POI, for open source. There are some commercial products based on POI also. for WORD consider textmining.org for XLS, POI does anything you need for powerpoint there is one commercial (it's about 1000$), but you can also find some source code in archives. And what do you think about using Open Office's UNO APIs ? I didn't knew about them. Are they implemented in Java? Yes Check out http://api.openoffice.org/ , They have good examples, I can also provide you my small test. You can do some amazing things with their API. Do they support all MSOffice formats (97/2000/XP)? Check http://www.openoffice.org/product/docs/OOoFlyer11s.pdf Jean-Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problems deleting documents / design question
Hi, I'm creating an index from several database tables. Every item within every table has a unique id which is saved in some kind of id-field and the table name in an other one. So together they form a unique identifier within the index. When deleting / updating an item I need to retrieve it. My first idea was indexreader.delete(new Term(id, id-value)); but this could delete several entries as id-value may appear in several databases. My second idea was to combine database name and id to form a kind of unique identifier but this seems to be not the right way as the problem may occur again with some sub-ids within a certain table. So my question is: is it possible to determine the item to be deleted by more than one term? thx, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: threading and indexing......
On Saturday 16 October 2004 02:14, Otis Gospodnetic wrote: If all 4 threads use the same instance of IndexWriter everything should be okay, as Lucene synchronizes vital blocks. And on a single CPU with a single disk using up to three threads even gives a bit of a speed up over one thread, 10-15% iirc. More threads were of no use for me in that case. Regards, Paul Elschot Otis --- Chris Fraschetti [EMAIL PROTECTED] wrote: if i have four threads all trying to call my index function, will lucene do what is necessary for each thread to wait until the writer is available.. or will the threads get an exception? -- ___ Chris Fraschetti, Student CompSci System Admin University of San Francisco e [EMAIL PROTECTED] | http://meteora.cs.usfca.edu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sorting and score ordering
On Wednesday 13 October 2004 19:53, Chris Fraschetti wrote: Is there a way I can (without recompiling) ... make the score have priority and then my sort take affect when two results have the same rank? Along with that, is there a simple way to assign a new scorer to the searcher? So I can use the same lucene algorithm for my hits, but tweak it a little to fit my needs? There is no one to one relationship between a seacher and a scorer. When a query consists eg. of two terms, there will be three scorers executing the search for that query: one TermScorer for each term, and one scorer to combine the other two to provide the search results, usually a BooleanScorer or a ConjunctionScorer. For proximity queries, other scorers are used. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Special field values
On Tuesday 12 October 2004 15:02, Otis Gospodnetic wrote: Hello Michael, This is something you'd have to code on your own. Otis --- Michael Hartmann [EMAIL PROTECTED] wrote: Hi everybody, I am thinking about extending the Lucene search with metadata in the following way Field Value --- Title (n1, n2, n3, ..., nm) | ni element of {0,1} and m amount of distinct metadata values for title Expressed in an informal way, I want to store a tuple of values in a field. The values in the tuple show whether a value is used in the title or not. A Lucene index can easily be used to determine whether or not a term is in a field of a document: IndexReader.open(indexName).termDocs(new Term(term, field)).skipTo(documentNr) returns the boolean indicating that. What do you need the {0,1} values for? Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Special field values
On Tuesday 12 October 2004 19:27, Paul Elschot wrote: IndexReader.open(indexName).termDocs(new Term(term, field)).skipTo(documentNr) returns the boolean indicating that. Well, almost. When it returns true one still needs to check the TermDocs for being at the documentNr. Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pull document scoring values
Zia, On Tuesday 28 September 2004 21:22, you wrote: Hi, I'm trying to learn the Scoring mechanism of Lucene. I want to fetch each parameter value individually as they are collectively dumped out by Explanation. I've managed to pull out TF and IDF values using DefaultSimilarity and FilterIndexReader, but not sure from where to get the fieldNorm and queryNorm from. The norms are here: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#norms(java.lang.String) The resulting array is indexed by the document number for the IndexReader. With the default similarity, each norm is the inverse square root of the number of indexed terms in the document field. However, there are only 8 bits available to encode this value, so it's quite rough. The default queryNorm is here: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DefaultSimilarity.html#queryNorm(float) There is an explanation of the scoring in the javadocs of Similarity. There has been some discussion on an idf factor that was missing from this documentation, I don't know whether the docs have been adapted for this. Also is there any reference about how normalisation has been implemented? See above, DefaultSimilarity is the default implementation of the Similarity interface. queryNorm() takes a sumOfSquaredWeights, where the weights are the term weights from the query. It returns the square root. It may be that the sum of squared weights should be a sum of square rooted weights and that queryNorm should return a square then. I posted this on lucene-user on 20 September: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=10023 It's only a normalisation, so it doesn't affect the order of the search results much. Taking the square roots of the query term weights would have the query weights directly apllied to the the query term density in the document field, whereas now the weights seem to be applied to the square root of the density. The density value is an approximation, see above for the rough field norms. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to pull document scoring values
On Wednesday 29 September 2004 15:41, Zia Syed wrote: Hi Paul, Thanks for your detailed reply! It really helped alot. However, I am experiancing some conflicts. For one of the documents in result set, when i use IndexReader fir=FilterIndexReader.open(index); byte[] fNorm=fir.norm(Body); System.out.println(FNorm: + fNorm[306]); Document d=fir.document(306); Field f=d.getField(Body); System.out.println(Body: + f.stringValue()); This gives me out fNorm 113, whereas total number of term (including stop-words) are 42 in this particular field of selected document. In the explanation , fieldNorm (field=Body, doc=306) is 0.1562, which is approx 41 term words for that field in that documents. So explanation values makes sense with real data, while including all stop words like to,it, the etc. So, my question is, Am i getting the norm values from right place? Yes, but the stored norms are encoded/decoded: byte Similarity.encodeNorm(float) float Similarity.decodeNorm(byte) Is there any way to find out number of indexed terms for each document? By default, the stored norm is the inverse square root of the number of indexed terms of an indexed document field. The encoding/decoding is somewhat rough, though. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP and Lucene
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Erik Hatcher wrote: On Sep 15, 2004, at 1:45 PM, Karthik N S wrote: 1) Is a there a PHP version of Lucene Implemantation avaliable , If so Where? Using the Java version of Lucene from PHP is my recommendation. There is not a PHP version. I'm not familiar with PHP details, but I suspect you can very easily use the Java version somehow. A bit tardy, but I was in-between versions, hence wanted to wait until I had posted the new ones up. We have developed a java-based daemon we call Luceneserver, and which listens on a port and understands either of two text protocols, one line-based, and one XML. This allows people to set up a server box centrally, and then use Php, Perl, Java, or whatever to index/search a central Lucene repository pretty easily. It has been designed such that you can partition off separate domains (eg. websites) within the same index, if you wish. In particular we've also developed a family of Php classes to talk to the above via the XML protocol, included in an opensource web development platform we call Axyl. Taken together, all of this might (or might not) be of some use to the original poster, as a starting point, or just for ideas. Version 2.1.1-1 of Axyl and Axyl-Lucene is available at: http://sourceforge.net/projects/axyl Cheers, Paul. - -- LIBRA (Sept. 23 - Oct. 22) Major achievements, new friends, and a previously unexplored way to make a lot of money will come to a lot of people today, but unfortunately you won't be one of them. Consider not getting out of bed today. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBUeVetfkpAgkMOyMRAm1pAJ9AAOh54bivGeyDLc9sdUMC8kmKmwCgvX9i +0JtZzP30AFVThe9z4BH0Fw= =9faA -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildCardQuery
On Tuesday 21 September 2004 06:50, Raju, Robinson (Cognizant) wrote: Is there a limitation in Lucene when it comes to wildcard search ? Is it a problem if we use less than 3 characters along with a wildcard(*). Gives me error if I try using 45* , *34 , *3 ..etc . Too Many Clauses Error Doesn't happen if '?' is used instead of '*'. The intriguing thing is , that it is not consistent . 00* doesn't fail. Am I missing something ? The number of clauses added to the query equals the number of indexed terms that match the wildcard. As each clause ends up using some buffer memory internally, a maximum was introduced to avoid running out of memory. You can change the maximum nr of added clauses using BooleanQuery.setMaxClauseCount() but then it is advisable to monitor memory usage, and evt. increase heap space for the JVM. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: displaying 'pages' of search results...
On Tuesday 21 September 2004 21:33, Chris Fraschetti wrote: I was wondering was the best way was to go about returning say 1,000,000 results, divided up into say 50 element sections and then accessing them via the first 50, second 50, etc etc. Is there a way to keep the query around so that lucene doesn't need to search again, or would the search be cached and no delay arise? Just looking for some ideas and possibly some implementational issues... Lucene's Hits class is designed for paging through search results. In which order would you need the 1.000.000 results? Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many boolean clauses
On Monday 20 September 2004 18:27, Shawn Konopinsky wrote: Hello There, Due to the fact that the [# TO #] range search works lexographically, I am forced to build a rather large boolean query to get range data from my index. I have an ID field that contains about 500,000 unique ids. If I want to query all records with ids [1-2000], I build a boolean query containing all the numbers in the range. eg. id:(1 2 3 ... 1999 2000) The problem with this is that I get the following error : org.apache.lucene.queryParser.ParseException: Too many boolean clauses Any ideas on how I might circumvent this issue by either finding a way to rewrite the query, or avoid the error? You can use this as an example: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/DateFilter.java (Just click view on the latest version to see the code). and iteratate over you doc ids instead of over dates. This will give you a filter for the doc ids you want to query. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Similarity scores: tf(), lengthNorm(), sumOfSquaredWeights().
After last week's discussion on idf() of the similarity score computation I looked into the score computation a bit deeper. In the DefaultSimilarity tf() is the sqrt() and lengthNorm() is the inverse of sqrt(). That means that the factor (docTf * docNorm) actually implements the square root of the density of the query term in the document field (ignoring the encoding and decoding of the norm). Summing these weighted square roots resembles a Salton OR p-Norm for p = 1/2, except that Salton defined the p-Norm's for p = 1, and the result is more like an AND p-Norm because it depends mostly on the minimum argument. The pnorm also requires that the sum is taken to the power 1/p, but this is not necessary as it would not change the ranking. I looked around for p-Norm's with 0p1, but I didn't find anything. Is there really nothing about this? A good discussion is here: http://elvis.slis.indiana.edu/irpub/SIGIR/1994/cite19.htm I would guess that since the sqrt() has an infinite derivative at zero, it might well be that this OR p-Norm for p = 1/2 behaves much like a rather high power AND p-Norm. The basic summing form of the OR p-Norm also allows a very easy implementation by just summing the weighted square roots; an AND p-Norm for p = 1 would have needed some more calculations. Is this perhaps one of the reasons for using a power p 1 ? Taking this a bit further, I also wonder about the name of sumOfSquaredWeights() in the Weight interface. Shouldn't that rather be sumOfPowerWeights() and by default implement a sum of square roots? This would allow a more straightforward comprehension of the of the term weights as directly weighing the term densities. Section 5 of the reference above has the full weighted p-Norm formula's. The OR p-Norm there is very close to the Lucene formula without coord(). Regards, Paul Elschot On Tuesday 14 September 2004 23:49, Doug Cutting wrote: Your analysis sounds correct. At base, a weight is a normalized tf*idf. So a document weight is: docTf * idf * docNorm and a query weight is: queryTf * idf * queryNorm where queryTf is always one. So the product of these is (docTf * idf * docNorm) * (idf * queryNorm), which indeed contains idf twice. I think the best documentation fix would be to add another idf(t) clause at the end of the formula, next to queryNorm(q), so this is clear. Does that sound right to you? Doug Ken McCracken wrote: Hi, I was looking through the score computation when running search, and think there may be a discrepancy between what is _documented_ in the org.apache.lucene.search.Similarity class overview Javadocs, and what actually occurs in the code. I believe the problem is only with the documentation. I'm pretty sure that there should be an idf^2 in the sum. Look at org.apache.lucene.search.TermQuery, the inner class TermWeight. You can see that first sumOfSquaredWeights() is called, followed by normalize(), during search. Further, the resulting value stored in the field value is set as the weightValue on the TermScorer. If we look at what happens to TermWeight, sumOfSquaredWeights() sets queryWeight to idf * boost. During normalize(), queryWeight is multiplied by the query norm, and value is set to queryWeight * idf == idf * boost * query norm * idf == idf^2 * boost * query norm. This becomes the weightValue in the TermScorer that is then used to multiply with the appropriate tf, etc., values. The remaining terms in the Similarity description are properly appended. I also see that the queryNorm effectively cancels out (dimensionally, since it is a 1/ square root of a sum of squares of idfs) one of the idfs, so the formula still ends up being roughly a TF-IDF formula. But the idf^2 should still be there, along with the expansion of queryNorm. Am I mistaken, or is the documentation off? Thanks for your help, -Ken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many boolean clauses
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote: Hey Paul, Thanks for the quick reply. Excuse my ignorance, but what do I do with the generated BitSet? You can return it in in the bits() method of the object implementing your org.apache.lucene.search.Filter (http://jakarta.apache.org/lucene/docs/api/index.html) Then pass the Filter to IndexSearcher.search() with the query. Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Too many boolean clauses
On Monday 20 September 2004 20:54, Shawn Konopinsky wrote: Hey Paul, ... Also - we are using a pooling feature which contains a pool of IndexSearchers that are used and tossed back each time we need to search. I'd hate to have to work around this and open up an IndexReader for this particular search, where all other searches use the pool. Suggestions? You could use a map from the IndexSearcher back to the IndexReader that was used to create it. (It's a bit of a waste because the IndexSearcher has a reader attribute internally.) Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem with locks when updating the data of a previous stored do cument
Hi, Using lucene-1.4.1.jar on WinXP I am having trouble with locking and updating an existing Lucene document. I delete the old document from the index and then add the new document to the index writer. I am using the minMerge docs set to 100 (much quicker!!) and close the writer once the batch is done, so the documents are flushed to the filesystem The problem i am having is I can't delete the old version of the document (after the first document has been added) using reader.delete because there is a lock on the index due to the IndexWriter being open. Am I doing this wrong or is there a simple way round this? Regards, Paul Code snippets of the update code (I have just cut and pasted the relevant line from my app to get an idea) reader = IndexReader.open(location); // Delete old doc/term if present if (reader.docFreq(docNumberTerm) 0) { reader.delete(docNumberTerm); . . . IndexWriter writer = null; // get the writer from the hash table so last few are cached and don't have to be restarted synchronized(IndexWriterCache) { String dbstring = + ldb; writer = (IndexWriter)IndexWriterCache.get(dbstring); if (writer == null) { //Not in cache so create one and add to cache for next time writer = new IndexWriter(location, new StandardAnalyzer(), new_index); writer.setUseCompoundFile(true); // Set the maximum number of entries per field. Default is 10,000 writer.maxFieldLength = MaxFieldCount; // Set how many docs will be stored in memory before being saved to disk writer.minMergeDocs = (int) DocsInMemory; IndexWriterCache.remove(dbstring); IndexWriterCache.put(dbstring, writer); } . . . // Add the docuents to the Lucene index writer.addDocument(doc); . . Some time later after a batch of docs been added writer.close(); DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. Thank you. Valid Information Systems Limited. Address: Morline House, 160 London Road, Barking, Essex, IG11 8BB. http://www.valinf.com Tel: +44 (0) 20 8215 1414 Fax: +44 (0) 20 8215 2040 Please note that as part of our drive to continually improve the service to our clients, we have established a dedicated support line for customers to call if they are in need of help with their installation of R/KYV or have a query regarding the operation of the software. The number is - 0870 0161414 This will ensure any call is carefully noted, any action required is scheduled for completion and any problem experienced handled by a carefully chosen team of developers. Please make a note of this number and pass it on to any other relevant person within your organisation. * -- Visit Valid who will sharing a stand with partners, Goss Interactive at the SOCITM Event, 10- 12 October 2004, Edinburgh International Conference Centre (EICC), Stand 26 26P-. Booking available online: www.socitm.gov.uk -- # This e-mail message has been scanned for Viruses and Content and cleared by NetIQ MailMarshal # - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Build problems
Danny, On Friday 03 September 2004 20:53, [EMAIL PROTECTED] wrote: I'm trying to build Lucene with ant (in XP) from the prompt I got the ant-optional.jar from http://archive.apache.org/dist/ant/binaries/ because I couldn't find it anywhere else. I'm running the newest version of ant and when I go into the lucene base directory and type 'ant' it finds the build.xml file but then gives the following error: BUILD FAILED C:\lucene\build.xml:140: srcdir C:\lucene\src\java does not exist! The src/java directory normally contains the java source files. Since that directory doesn't exist you may want to create it by installing the sources, eg. by checking out from cvs, or from a jar or that contains the java sources here: http://dist.apache.easynet.nl/jakarta/lucene/source/ Lucene 1.4.1 is out, but it's not available there yet. In case you want that version please ask on lucene-dev. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using 2nd Index to constraing Search
On Friday 27 August 2004 20:10, Mike Upshon wrote: Hi Just starting to evaluate Lucene and hope somone can answer this question. I am looking at using Lucene to index a very large databse. There is a documents table and a few other tables that define what users can view what documents. My question is, is it posible to have an index of the The normal way of doing that is to: - make a list of all doc id's for the user. - from this list construct a Filter for use in the full text index. Sort the doc id's, use an IndexReader on the full text index, construct a Term for each doc id, walk the termDocs() for the Term, and set a bit in the filter to allow the document number for the doc id. - keep this filter to restrict the searches for the user by IndexSearcher.search(Query,Filter) - rebuild the filter when the doc id's for the user change, or when the full text index changes (a document deletion followed by an optimize or an add can change any other document's number). Hmm, this is getting to be a FAQ. full text contents of the documents and another index that contains the document id's and the user id's and then use the 2nd index to qualify the full text search over the document table. The reason I want to do this is to reduce the numbers of documents that the full text query will run. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question concerning speed of Lucene.....
Oliver, On Friday 27 August 2004 22:20, you wrote: Hi, I guess this one of the most often asked question on this mailing list, but hopefully my question is more specific, so that I can get some input from you. My project is to implement an agency system for newspapers. So I have to handle about 30 days of text and IPTC data. The later is taken from images provided by the agencies. I basically get a constant stream of text messages from the agencies (roughly 2000 per day per agency) and images (roughly 1000 per day per agency). I have to deal with 4 text and 6 image agencies. So my daily input is 8000 text messages and 6000 images. The extracted documents from these text messages and images have a size of about 1kb. The extraction of the data and converting them to Document objects is already finished and the search using lucence works like a charm. Brilliant software! But now to my questions. In order to understand what I am doing, like to talk a little about the kind of queries and data I have to deal with. * Every message has a priority. An integer value ranging from 1 to 6. * Every message has a receive date. * Every message has an agency assigned, basically a unique string identifier for it. * Every message has some header data, that is also indexed for refined searches. * And of course the actual text included in the text message itself or the IPTC header of an image. Typically I have to kinds of queries. * Typical relational queries * Show every text messages from a certain agency in the last X days. Probably good for a date filter, see the wiki on RangeQuery, and evt. my previous message on filters (using 2nd index on constraining). Lucene has no facilities for primary keys, so that is up to you. * Show every image or text message with a higher priority then Y and from a certain period of time. RangeQuery again for the priority. One can store images in Lucene, but currently only in String format, ie. they'll need some conversion. There was some talk on binary objects (not too) recently, but that is still in development. I'd probably store the images in a file system or in another db for now. OTOH, if you're willing to help storing binary images lucene-dev is nearby. * Fulltext search Yes :) * A real fulltext search over all elements using the full power of lucences query language. Span queries are currently not supported by the query language, you might have a look at the org.apache.lucene.search.spans package. It is absolutely no question anymore, that the later queries will be done using Lucene. But can the first type of query is the thing I am thinking about. Can this be done effeciently with Lucene? So far we use a system Lucene can be as fast as relational databases, provided your lower level java code on IndexReader plays nice with system resources like disk heads and RAM. That means using filters, sorting on index order before using an index and evt. sorting on document number before retrieving stored fields. Lucene's IndexSearcher for searching text queries is quite well behaved in that respect. that uses a SQL database engine for storing the relevant data and is used in these queries. But if Lucene is fast enough with these queries too, I am willing to skip the SQL database at all. But I have to remind, that I will be indexing about 400.000 messages per month. To easily keep the primary keys in sync between the SQL db and Lucene, I'd start by keeping the images and the full text only in the SQL db. Lucene optimisations (needed after adding/deleting docs) copy all data so it pays to keep the Lucene indexes small. Later you might need multiple indexes, MultiSearcher, and occasionally a merge of the indexes. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How not to show results with the same score?
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote: hi there, i browsed through the list and had some different searches but i do not find, what i'm looking for. i got an index which is generated by a bot, collecting websites. there are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1 these different urls have the same content and when u search for a word, matching, both are returned, which is correct. they have excatly the same score because of there content an so one, so i would like to know if its possible to group by (mysql, of course) the returned score, so that only the first match is collected into Hits and all following matches with the same score are ignored. it would be great if anyone has an idea how to do that. You can implement your own HitCollector and pass it to IndexSearcher.search() Have a look at the javadocs of the org.apache.lucene.search package, it's quite straightforward. The PriorityQueue from the util package is useful to collect results. For every distinct score you could store an int[] of document nrs in there while collecting the hits. Basically you'll end up implementing your own Hits class. For URL's that have the same content, it's better to store multiple URL's for the same document. However, this merging is normally done by a crawler because the same contents means the same outgoing URL's. Crawlers also keep track of multiple host names resolving to the same IP address. In case you need to crawl and index an intranet or more, have a look at Nutch. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index Size
On Wednesday 18 August 2004 22:44, Rob Jose wrote: Hello I have indexed several thousand (52 to be exact) text files and I keep running out of disk space to store the indexes. The size of the documents I have indexed is around 2.5 GB. The size of the Lucene indexes is around 287 GB. Does this seem correct? I am not storing the contents of the As noted, one would expect the index size to be about 35% of the original text, ie. about 2.5GB * 35% = 800MB. That is two orders of magnitude off from what you have. Could you provide some more information about the field structure, ie. how many fields, which fields are stored, which fields are indexed, evt. use of non standard analyzers, and evt. non standard Lucene settings? You might also try to change to non compound format to have a look at the sizes of the individual index files, see file formats on the lucene web site. You can then see the total disk size of for example the stored fields. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: PDFBox Issue
What version of the log4j jar are you using? -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 29, 2004 8:06 AM To: Lucene Users List Subject: PDFBox Issue Hi all, I know that this is a Lucene list but wanted to know if any of you have gotten this error before using PDFBox? I've gotten the latest version of PDFBox and it is giving me the following error: java.lang.VerifyError: (class: org/apache/log4j/LogManager, method: clinit signature: ()V) Incompatible argument to function at org.apache.log4j.Logger.getLogger(Logger.java:94) at org.pdfbox.pdfparser.PDFParser.clinit(PDFParser.java:57) at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocum ent.java:197) at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocu ment.java:118) at Index.indexFile(Index.java:287) at Index.indexDirectory(Index.java:265) at Index.update(Index.java:63) at Lucene.main(Lucene.java:26) Exception in thread main I am using all the jar files that came with PDFBox. Anyone run into this problem. I am using the following line of code: Document doc = LucenePDFDocument.getDocument(f); Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: PDFBox Issue
I actually thought it might have been trying to use the log4j 1.3 'alpha' build (there is no 'alpha' build yet, but notionally the latest HEAD isn't too far from it). There has been a subtle change to log4j in recent months that could have a similar impact. Cheers, Paul Smith -Original Message- From: Ben Litchfield [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 17, 2004 10:48 PM To: Lucene Users List Subject: Re: PDFBox Issue PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar file), I believe that 1.2.8 is the latest. I will make sure that the next version of PDFBox includes the latest log4j version, which I assume is what everybody would like to use. But, by looking at the below error message it appears that you might have an older log4j in your classpath Logger.getLogger( Class ) is available in 1.2.5 and 1.2.8 Ben On Tue, 17 Aug 2004, Don Vaillancourt wrote: Wow, this is an old message. I managed to get my code to work by using the previous version of PDFBox. I had used the version of log4j that had come with PDFBox. Someone had mentioned recompiling log4j, but I couldn't get the project to import the source into Eclipse, so I gave up. But things work great with the version of PDFBox that I compiled with so I am fine with that. As for the version of log4j, I could not tell you, as I said above it came with PDFBox, so I'm guessing that it had probably not been tested with the version of log4j it was being distributed with. Paul Smith wrote: What version of the log4j jar are you using? -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 29, 2004 8:06 AM To: Lucene Users List Subject: PDFBox Issue Hi all, I know that this is a Lucene list but wanted to know if any of you have gotten this error before using PDFBox? I've gotten the latest version of PDFBox and it is giving me the following error: java.lang.VerifyError: (class: org/apache/log4j/LogManager, method: clinit signature: ()V) Incompatible argument to function at org.apache.log4j.Logger.getLogger(Logger.java:94) at org.pdfbox.pdfparser.PDFParser.clinit(PDFParser.java:57) at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDoc um ent.java:197) at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDo cu ment.java:118) at Index.indexFile(Index.java:287) at Index.indexDirectory(Index.java:265) at Index.update(Index.java:63) at Lucene.main(Lucene.java:26) Exception in thread main I am using all the jar files that came with PDFBox. Anyone run into this problem. I am using the following line of code: Document doc = LucenePDFDocument.getDocument(f); Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- *Don Vaillancourt Director of Software Development * *WEB IMPACT INC.* phone: 416-815-2000 ext. 245 fax: 416-815-2001 email: [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] web: http://www.web-impact.com / This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. / - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance when computing computing a filter using hundreds of diff terms.
Kevin, On Thursday 05 August 2004 23:32, Kevin A. Burton wrote: I'm trying to compute a filter to match documents in our index by a set of terms. For example some documents have a given field 'category' so I need to compute a filter with mulitple categories. The problem is that our category list is 200 items so it takes about 80 seconds to compute. We cache it of course but this seems WAY too slow. Is there anything I could do to speed it up? Maybe run the queries myself and then combine the bitsets? That would be a first step. We're using a BooleanQuery with nested TermQueries to build up the filter... I suppose that is a BooleanQuery with all terms optional? Depending on the number of docs in the index and the distribution of the categories over the classes that might lead to a lot of disk head movements. Recently some code was posted to compute a filter for date ranges. For each date (ie. Term) in the range it would walk all documents and set the corresponding bit in a bitset. You can use the same approach. See IndexReader.termDocs(Term) for starters, and preferably iterate over the categories (Terms) in sorted order. A BooleanQuery would do much the same thing, but it has to work in document order for all Term's at the same time, which can cause extra disk seeks between the TermDocs. You can avoid those disk seeks by iterating over the TermDocs yourself and keeping the results in the bitset. If you do this in with sorted terms, ideally the disk head would move in a single direction for the whole process. For maximum performance you might want to avoid searching other Query's or similar TermDoc iterators at the same time. Also avoid retrieving documents while this is going on, just keep that disk head moving only where you want it to. For further CPU speedup you can cache the TermDocs using the read() method. Lucene's TermScorer does this, see http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java and use 'view' on the latest revision. A bigger cache size than 32 would seem appropriate for your case. Could you evt. report the speedup? I guess you should be able to bring it down to at most twenty seconds or so. After that, replication over multiple disks might help, giving each of them an interval of the sorted categories to search. Good luck, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Question on number of fields in a document.
On Wednesday 04 August 2004 18:22, John Z wrote: Hi I had a question related to number of fields in a document. Is there any limit to the number of fields you can have in an index. We have around 25-30 fields per document at present, about 6 are keywords, Around 6 stored, but not indexed and rest of them are text, which is analyzed and indexed fields. We are planning on adding around 24 more fields , mostly keywords. Does anyone see any issues with this? Impact to search or index ? During search one byte of RAM is needed per searched field per document for the normalisation factors, even if a document field is empty. This RAM is occupied the first time a field is searched after opening an index reader. Supposing your queries would actually search 50 fields before closing the index reader, the norms would occupy 50 bytes/doc, or 1 GB / 20MDocs. Regards, Paul Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: pdfbox performance.
The first thing that I would do is wrap the FileInputStream with a BufferedInputStream. Change: FileInputStream reader = new FileInputStream(file); To: InputStream reader = new BufferedInputStream(new FileInputStream(file)); You get a significant boost reading in from a buffer, particularly as the size of the file grows. Try that first, and then rebenchmark. Cheers Paul Smith -Original Message- From: Miroslaw Milewski [mailto:[EMAIL PROTECTED] Sent: Thursday, July 29, 2004 7:24 AM To: [EMAIL PROTECTED] Subject: pdfbox performance. Hi, I have a serious performance problem while extracting text from pdf. Here is the code (w/o try/catch blocks): File file = new File(test.pdf); FileInputStream reader = new FileInputStream(file); PDFParser parser = new PDFParser(reader); parser.parse(); PDDocument pdDoc = parser.getPDDocument(); PDFTextStripper stripper = new PDFTextStripper(); String pdftext = stripper.getText(pdDoc); pdDoc.close(); Now, the whole process takes: - 37,4 sec w. a 74 kB file (parsing took 5,3 sec.) - 156,7 sec w. a 150 kB file (parsing: 11,0 sec.) - 157,8 sec w. a 270 kB file (parsing: 34,3 sec.) - 313,3 sec w. a 151 kB file (parsing: 5,9 sec.) Now, I can't really get the point here. Is this performance standard for pdfbox? Or is it my system (win2k, PIII 700, 512 RAM), or the code, or maybe the pdf docs (text only, the last one with some UML diags.) I am writing a knowledge base system at the moment, and planned to do real-time text extraction and indexing (using Lucene.) But this is not realistic, considering the extraction thime. Then maybe it is a better idea to run the extraction and indexing once every 24 h, processing all the documents added during that period. TIA for any comments/suggestions. -- Miroslaw Milewski - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rebuild and corruption
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Steve Rajavuori wrote: I have two questions. 1) Can anyone recommend the best way to avoid any possibility of corruption in the case where an IndexWriter doesn't get closed properly? (It seems that termination during a merge operation is the most vulnerable point.) 2) Is there any way to recover a corrupted index, other than rebuilding from scratch? I am also extremely interested in any answers to these questions. Cheers, Paul. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFBCB+8tfkpAgkMOyMRAm7fAJ47u2eLNB9o98aI8rqQPHfNUK5QpQCePe0m p0hm3iPtCxIZd9JUr6PfJ3I= =CepG -END PGP SIGNATURE- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching of TermDocs
On Monday 26 July 2004 21:41, John Patterson wrote: Is there any way to cache TermDocs? Is this a good idea? Lucene does this internally by buffering up to 32 document numbers in advance for a query Term. You can view the details here in case you're interested: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java It uses the TermDocs.read() method to fill a buffer of document numbers. Is this what you had in mind? Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]