Re: Question about proximity searching and wildcards
Mariella Di Giacomo writes: > Hello, > > We are using Lucene to index scientific articles. > We are also using Luke to verify the fields and values we index. > > One of the fields we index is the author field that consists of the authors > that have written the scientific article (an example of such data is shown > at the bottom of the email). > > The most common search on the author field is the following: > > "find all the authors whose last name starts with Cole and the first name > starts with S" > > We thought of a proximity search (we want to make sure we take the first > name and not the middle name/initial) similar like that > Query parser cannot do that. > "Author:cole* S*"~1 In that case you cannot expand the wildcard terms. > "Author:cole* AND Author:S*"~1 You cannot mix boolean queries and proximity queries. What comes next to your query is phrase prefix query, but that's designed for something like 'Cole S*' not 'Cole* S*'. Searching for 'Cole* S*' means to search for all combinations of possible expansions of Cole and S. You can do that by expanding the terms yourself but I'd expect that a) to be slow and b) to create trouble with the maximum number of boolean terms (or memory usage). Given that there are 10 expansions of Cole and 500 of S (that's not just first names, that all names) you have to do 5000 proximity searches. > If Luke cannot deal with that, when writing the query through the Java > application, which would be the > query to be provided to get what expected ? > Do we need to use a query filter ? > I would use different fields for first and last name in this case. And if it's relevant to search for the first character of the first name, I'd index that additionally. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermPositionVector
Hi, I am adding a field to a document in the index as follows doc.add(new Field("contents",reader,Field.TermVector.WITH_POSITIONS)) Later,I query the index and get the document id of this document. The following code, however, prints "false". TermFreqVector tfv = reader.getTermFreqVector(docId,"contents"); System.out.println("Is a TermPositionVector " + (tfv instanceof TermPositionVector)); Using Field.TermVector.WITH_POSITIONS_OFFSETS, while creating the field, also produces the same result. Can someone tell me why this is happening ? Thanks, Siddharth - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sharing lock files on multiple computers
On Jan 18, 2005, at 8:09 PM, Chris Hostetter wrote: : > ...which prompts me to wonder, how do people do this (ie: configure : > lockDir such that processes on seperate physical computers respect : > eachothers locks) without using NFS? My question is: Given the assertion that it's not safe to keep lock files on an NFS partition, what mechanism do/would/should people use to enable two applications running on seperate physical machines to use the same lock file directory? I don't have experience with NFS, but the issue has cropped up numerous times on this e-mail list, and the general advice is "don't use Lucene on NFS drives, period", and that is why we provided that same advice in LIA. However, I'm admittedly unknowledgeable in the reason the problem exists. (this question is based on the understanding that unless the applications are sharing the same lock directory, and index may be corrupted by concurrent modifications, correct?) Right - concurrent writes to the index could cause trouble. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sharing lock files on multiple computers
: > ...which prompts me to wonder, how do people do this (ie: configure : > lockDir such that processes on seperate physical computers respect : > eachothers locks) without using NFS? : : There is a system property that controls where the lock files are : written: : : http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head : -59be30838bbb5692e605384b5f4c2f224f3dfa6f Um, yeah. I'm actually the one that added that FAQ answer last week :) My question is: Given the assertion that it's not safe to keep lock files on an NFS partition, what mechanism do/would/should people use to enable two applications running on seperate physical machines to use the same lock file directory? (this question is based on the understanding that unless the applications are sharing the same lock directory, and index may be corrupted by concurrent modifications, correct?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sharing lock files on multiple computers
On Jan 18, 2005, at 6:51 PM, Chris Hostetter wrote: that said, the same paragraph of LIA does say... If you have multiple computers that need to access the same index stored on a shared disk, you should set the lock directory explicitly so that applications on different computers see each other's locks. http://www.lucenebook.com/search? query=multiple+computers+%22see+each+other%27s+locks%22 ...which prompts me to wonder, how do people do this (ie: configure lockDir such that processes on seperate physical computers respect eachothers locks) without using NFS? There is a system property that controls where the lock files are written: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head -59be30838bbb5692e605384b5f4c2f224f3dfa6f - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
sharing lock files on multiple computers
LIA mentions that it's not a good idea to put Lock files on an NFS volume, I can't think offhand of any specific examples of why this is bad, but based on on my experience with NFS I'm not surprised by the advice either. that said, the same paragraph of LIA does say... If you have multiple computers that need to access the same index stored on a shared disk, you should set the lock directory explicitly so that applications on different computers see each other's locks. http://www.lucenebook.com/search?query=multiple+computers+%22see+each+other%27s+locks%22 ...which prompts me to wonder, how do people do this (ie: configure lockDir such that processes on seperate physical computers respect eachothers locks) without using NFS? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'db' sandbox contribution update
On Jan 19, 2005, at 00:02, Andi Vajda wrote: Well, normally, if you're in a 100% Java situation, you could use the Berkeley DB Java edition instead. Alternatively, did anyone played with JDBM [1] to achieve the same result? I'm not. I'm using the same code with Chandler, a python program, and PyLucene (http://pylucene.osafoundation.org). Chandler and PyLucene share the same database environment and this can only be done if the C edition of Berkeley DB is the underlying db implementation. I see. By the way, is Chandler ever going to be released in our lifetime? :o) While waiting for Godot, there is always Haystack [2]. Cheers -- PA http://alt.textdrive.com/ [1] http://jdbm.sourceforge.net/ [2] http://haystack.lcs.mit.edu/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ParallellMultiSearcher Vs. One big Index
The test system is not multithreaded currently, i.e. the queries are executed serially. Which explains why the multi-term, single index was slower.. Ie. Only using one thread vs the parallel multisearcher using many. I had plenty of CPU on the multi-term single index. So if I were to make my querier multithreaded, the fastest index configuration would ideally be one big index? Thanks you for your help! Ryan -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 18, 2005 11:32 AM To: Lucene Users List Subject: Re: ParallellMultiSearcher Vs. One big Index Ryan Aslett wrote: > What I found was that for queries with one term (First Name), the large > index beat the multiple indexes hands down (280 Queries/per second vs > 170 Q/s). > But for queries with multiple terms (Address), the multiple indexes beat > out the Large index. (26 Q/s vs 16 Q/s) > Btw, Im running these on a 2 proc box with 16GB of ram. > > So what Im trying to determine Is if there is some equations out there > that can help me find the sweet spot for splitting my indexes. What appears to be the bottleneck, CPU or i/o? Is your test system multi-threaded? I.e., is it attempting to execute many queries in parallel? If you're CPU-bound then a single index should be fastest. Are you using compound format? If you're i/o-bound, the non-compound format may be somewhat faster, as it permits more parallel i/o. Is the index data on multiple drives? If you're i/o bound then it should be faster to use multiple drives. To permit even more parallel i/o over multiple drives you might consider using a pool of IndexReaders. That way, with, e.g., striped data, each could be simultaneously reading different portions of the same file. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'db' sandbox contribution update
Hmmm... out of curiosity... any reason not to use the Berkeley DB Java Edition instead of the Java API to C Berkeley DB? http://www.sleepycat.com/products/je.shtml Well, normally, if you're in a 100% Java situation, you could use the Berkeley DB Java edition instead. I'm not. I'm using the same code with Chandler, a python program, and PyLucene (http://pylucene.osafoundation.org). Chandler and PyLucene share the same database environment and this can only be done if the C edition of Berkeley DB is the underlying db implementation. There are three Java APIs for Berkeley DB available now: - Java API for C Berkeley DB 4.2.x - Java API for C Berkeley DB 4.3.x - Berkeley DB 100% Java Edition These APIs are different from each other although 4.3.x and 100% Java are close. Many months ago, somebody contacted me about rewriting DbDirectory for the Java Edition of Berkeley DB, but I haven't heard from him in a long long while. Andi.. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 'db' sandbox contribution update
On Jan 18, 2005, at 22:26, Andi Vajda wrote: With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java API to C Berkeley DB. Hmmm... out of curiosity... any reason not to use the Berkeley DB Java Edition instead of the Java API to C Berkeley DB? http://www.sleepycat.com/products/je.shtml Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Question about proximity searching and wildcards
Hello, We are using Lucene to index scientific articles. We are also using Luke to verify the fields and values we index. One of the fields we index is the author field that consists of the authors that have written the scientific article (an example of such data is shown at the bottom of the email). The most common search on the author field is the following: "find all the authors whose last name starts with Cole and the first name starts with S" We thought of a proximity search (we want to make sure we take the first name and not the middle name/initial) similar like that "Author:cole* S*"~1 "Author:cole* AND Author:S*"~1 What we were expecting was: all the documents that contain authors whose last name starts with Cole and the first name start with S and those words are near (next to each other) Unfortunately when we type that search through the Luke "search interface" we do not get the expansion of the words when using the similarity at the same time So my question: 1) Does Luke cannot deal with that ? 2) Is the query not properly structured to get what expected ? Which would be the correct one ? If Luke cannot deal with that, when writing the query through the Java application, which would be the query to be provided to get what expected ? Do we need to use a query filter ? Thanks a lot in advance for your help, Mariella _ E.g. Below there are three examples of data we index and to be precise the information related to the Authors field. The following is the information related to scientific articles that we index. 1) The Authors field consists of two authors Title: Using Document Dimensions for Enhanced Information Retrieval Authors: Jayasooriya, Thimala([EMAIL PROTECTED]); Manandhar, Suresha([EMAIL PROTECTED]) Affiliations: a. Department of Computer Science, University of York Abstract (English): Conventional document search techniques are constrained by attempting to match individual keywords or phrases to source documents. Thus, these techniques miss out documents that contain semantically similar terms, thereby achieving a relatively low degree of recall. At the same time, processing capabilities and tools for syntactic and semantic analysis of language have advanced to the point where an index-time linguistic analysis of source documents is both feasible and realistic. In this paper, we introduce document dimensions, a means of classifying or grouping terms discovered in documents. Using an enhanced version of Jakarta Lucene[1], we demonstrate that supplementing keyword analysis with some syntactic and semantic information can indeed enhance the quality of information retrieval results. Publisher: Springer-Verlag Publication Type: Original Paper ISSN: 0302-9743 ISBN: 3-540-23659-7 Book DOI: 10.1007/b101591 2) The Authors field consists of six authors Title: Multilingual Retrieval Experiments with MIMOR at the University of Hildesheim Authors: Hackl, Renéa; Kölle, Ralpha; Mandl, Thomasa([EMAIL PROTECTED]); Ploedt, Alexandraa; Scheufen, Jan-Hendrika; Womser-Hacker, Christaa Affiliations: a. University of Hildesheim, Information Science, Marienburger Platz 22, D-31141 Hildesheim Abstract (English): Fusion and optimization based relevance judgements have proven to be successful strategies in information retrieval. In this years CLEF campaign we applied these strategies to multilingual retrieval with four languages. Our fusion experiments were carried out using freely available software. We used the snowball stemmers, internet translation services and the text retrieval tools in Lucene and the new MySQL. Publisher: Springer-Verlag Publication Type: Original Paper ISSN: 0302-9743 ISBN: 3-540-24017-9 Book DOI: 10.1007/b102261 3) The Authors field consists of one author and only middle and first initial are provided Title: Letter to the editor Author: Coleman, S.S.a Affiliations: a. Department of Orthopaedics, The University of Utah School of Medicine, 50 North Medical Drive, Salt Lake City, UT 84132, USA US Abstract: No Abstract Publisher: Springer-Verlag Item Identifier: 10.1007/s00264113 Publication Type: Article ISSN: 0341-2695
Re: 'db' sandbox contribution update
Jian, I'd like to know when I use Lucene, normally under what condition I should use the db (berkeley db) directory instead of using the standard file system based directory? Could you please let me know some brief comparisons of using berkeley db vs. using file system and what is better? Berkeley DB is a real database offering ACID transactions, FSDirectory is not. Berkeley DB can be very lightweight and is easily embedded in your application. For more information on Berkeley DB, see: http://www.sleepycat.com. When to use DbDirectory over FSDirectory really depends on your needs and constraints. If your index does not exceed the limits of your file system and you have no real concurrency needs then FSDirectory is fine. If you want/need undoable transactions to wrap your index access calls, DbDirectory is probably a better choice. Andi.. Thanks, Jian On Tue, 18 Jan 2005 13:26:16 -0800 (PST), Andi Vajda <[EMAIL PROTECTED]> wrote: With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java API to C Berkeley DB. This is to announce that the updates to the DbDirectory implementation I submitted were committed to the lucene sandbox at: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db I also updated the 'Lucene in Action' samples that illustrate how to use this Berkeley DB-based implementation of org.apache.lucene.store.Directory. They are included below. Andi.. /* --- BerkeleyDbIndexer.java --- */ package lia.tools; import com.sleepycat.db.EnvironmentConfig; import com.sleepycat.db.Environment; import com.sleepycat.db.Transaction; import com.sleepycat.db.Database; import com.sleepycat.db.DatabaseConfig; import com.sleepycat.db.DatabaseType; import com.sleepycat.db.DatabaseException; import java.io.File; import java.io.IOException; import org.apache.lucene.store.db.DbDirectory; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public class BerkeleyDbIndexer { public static void main(String[] args) throws IOException, DatabaseException { if (args.length < 1) { System.err.println("Usage: BerkeleyDbIndexer -create"); System.exit(-1); } String indexDir = args[0]; boolean create = args.length == 2 ? args[1].equals("-create") : false; File dbHome = new File(indexDir); if (!dbHome.exists()) dbHome.mkdir(); else if (create) { File[] files = dbHome.listFiles(); for (int i = 0; i < files.length; i++) if (files[i].getName().startsWith("__")) files[i].delete(); } EnvironmentConfig envConfig = new EnvironmentConfig(); DatabaseConfig dbConfig = new DatabaseConfig(); envConfig.setTransactional(true); envConfig.setInitializeCache(true); envConfig.setInitializeLocking(true); envConfig.setInitializeLogging(true); envConfig.setLogInMemory(true); envConfig.setAllowCreate(true); envConfig.setThreaded(true); dbConfig.setAllowCreate(true); dbConfig.setType(DatabaseType.BTREE); Environment env = new Environment(dbHome, envConfig); Transaction txn = null; Database index, blocks; try { txn = env.beginTransaction(null, null); index = env.openDatabase(txn, "__index__", null, dbConfig); blocks = env.openDatabase(txn, "__blocks__", null, dbConfig); } catch (DatabaseException e) { if (txn != null) { txn.abort(); txn = null; } throw e; } finally { if (txn != null) txn.commit(); txn = null; } DbDirectory directory; IndexWriter writer; try { txn = env.beginTransaction(null, null); directory = new DbDirectory(txn, index, blocks); writer = new IndexWriter(directory, new StandardAnalyzer(), create); writer.setUseCompoundFile(false); Document doc = new Document(); doc.add(Field.Text("contents", "The quick brown fox...")); writer.addDocument(doc); writer.optimize(); writer.close(); } catch (IOException e) { txn.abort(); txn = null; throw e; } catch (DatabaseException e) { if (txn != null) { txn.abort(); txn = null; } throw e; } finally { if (txn != null) txn.commit(); index.close(); blocks.close(); env.close(); } System.out.println("Indexi
Re: 'db' sandbox contribution update
Hi, Andi, I'd like to know when I use Lucene, normally under what condition I should use the db (berkeley db) directory instead of using the standard file system based directory? Could you please let me know some brief comparisons of using berkeley db vs. using file system and what is better? Thanks, Jian On Tue, 18 Jan 2005 13:26:16 -0800 (PST), Andi Vajda <[EMAIL PROTECTED]> wrote: > > With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java > API to C Berkeley DB. This is to announce that the updates to the DbDirectory > implementation I submitted were committed to the lucene sandbox at: > http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db > > I also updated the 'Lucene in Action' samples that illustrate how to use this > Berkeley DB-based implementation of org.apache.lucene.store.Directory. > They are included below. > > Andi.. > > /* --- BerkeleyDbIndexer.java --- */ > > package lia.tools; > > import com.sleepycat.db.EnvironmentConfig; > import com.sleepycat.db.Environment; > import com.sleepycat.db.Transaction; > import com.sleepycat.db.Database; > import com.sleepycat.db.DatabaseConfig; > import com.sleepycat.db.DatabaseType; > import com.sleepycat.db.DatabaseException; > > import java.io.File; > import java.io.IOException; > > import org.apache.lucene.store.db.DbDirectory; > import org.apache.lucene.index.IndexWriter; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > > public class BerkeleyDbIndexer { > > public static void main(String[] args) > throws IOException, DatabaseException > { > if (args.length < 1) > { > System.err.println("Usage: BerkeleyDbIndexer > -create"); > System.exit(-1); > } > > String indexDir = args[0]; > boolean create = args.length == 2 ? args[1].equals("-create") : > false; > File dbHome = new File(indexDir); > > if (!dbHome.exists()) > dbHome.mkdir(); > else if (create) > { > File[] files = dbHome.listFiles(); > > for (int i = 0; i < files.length; i++) > if (files[i].getName().startsWith("__")) > files[i].delete(); > } > > EnvironmentConfig envConfig = new EnvironmentConfig(); > DatabaseConfig dbConfig = new DatabaseConfig(); > > envConfig.setTransactional(true); > envConfig.setInitializeCache(true); > envConfig.setInitializeLocking(true); > envConfig.setInitializeLogging(true); > envConfig.setLogInMemory(true); > envConfig.setAllowCreate(true); > envConfig.setThreaded(true); > dbConfig.setAllowCreate(true); > dbConfig.setType(DatabaseType.BTREE); > > Environment env = new Environment(dbHome, envConfig); > Transaction txn = null; > Database index, blocks; > > try { > txn = env.beginTransaction(null, null); > index = env.openDatabase(txn, "__index__", null, dbConfig); > blocks = env.openDatabase(txn, "__blocks__", null, dbConfig); > } catch (DatabaseException e) { > if (txn != null) > { > txn.abort(); > txn = null; > } > throw e; > } finally { > if (txn != null) > txn.commit(); > txn = null; > } > > DbDirectory directory; > IndexWriter writer; > > try { > txn = env.beginTransaction(null, null); > directory = new DbDirectory(txn, index, blocks); > writer = new IndexWriter(directory, new StandardAnalyzer(), > create); > writer.setUseCompoundFile(false); > > Document doc = new Document(); > doc.add(Field.Text("contents", "The quick brown fox...")); > writer.addDocument(doc); > > writer.optimize(); > writer.close(); > } catch (IOException e) { > txn.abort(); > txn = null; > throw e; > } catch (DatabaseException e) { > if (txn != null) > { > txn.abort(); > txn = null; > } > throw e; > } finally { > if (txn != null) > txn.commit(); > > index.close(); > blocks.close(); > env.close(); > } > > System.out.println("Indexing Complete"); > } > } > > /* --- BerkeleyDbSearcher.java --- */ > > package lia.tools; > > import com.sleepycat.db.EnvironmentConfig; > import com.sleepycat.db.Environment; > import com.sleepycat.db.Transaction; > import com.sleepycat.db.Database; > import
'db' sandbox contribution update
With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java API to C Berkeley DB. This is to announce that the updates to the DbDirectory implementation I submitted were committed to the lucene sandbox at: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db I also updated the 'Lucene in Action' samples that illustrate how to use this Berkeley DB-based implementation of org.apache.lucene.store.Directory. They are included below. Andi.. /* --- BerkeleyDbIndexer.java --- */ package lia.tools; import com.sleepycat.db.EnvironmentConfig; import com.sleepycat.db.Environment; import com.sleepycat.db.Transaction; import com.sleepycat.db.Database; import com.sleepycat.db.DatabaseConfig; import com.sleepycat.db.DatabaseType; import com.sleepycat.db.DatabaseException; import java.io.File; import java.io.IOException; import org.apache.lucene.store.db.DbDirectory; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public class BerkeleyDbIndexer { public static void main(String[] args) throws IOException, DatabaseException { if (args.length < 1) { System.err.println("Usage: BerkeleyDbIndexer -create"); System.exit(-1); } String indexDir = args[0]; boolean create = args.length == 2 ? args[1].equals("-create") : false; File dbHome = new File(indexDir); if (!dbHome.exists()) dbHome.mkdir(); else if (create) { File[] files = dbHome.listFiles(); for (int i = 0; i < files.length; i++) if (files[i].getName().startsWith("__")) files[i].delete(); } EnvironmentConfig envConfig = new EnvironmentConfig(); DatabaseConfig dbConfig = new DatabaseConfig(); envConfig.setTransactional(true); envConfig.setInitializeCache(true); envConfig.setInitializeLocking(true); envConfig.setInitializeLogging(true); envConfig.setLogInMemory(true); envConfig.setAllowCreate(true); envConfig.setThreaded(true); dbConfig.setAllowCreate(true); dbConfig.setType(DatabaseType.BTREE); Environment env = new Environment(dbHome, envConfig); Transaction txn = null; Database index, blocks; try { txn = env.beginTransaction(null, null); index = env.openDatabase(txn, "__index__", null, dbConfig); blocks = env.openDatabase(txn, "__blocks__", null, dbConfig); } catch (DatabaseException e) { if (txn != null) { txn.abort(); txn = null; } throw e; } finally { if (txn != null) txn.commit(); txn = null; } DbDirectory directory; IndexWriter writer; try { txn = env.beginTransaction(null, null); directory = new DbDirectory(txn, index, blocks); writer = new IndexWriter(directory, new StandardAnalyzer(), create); writer.setUseCompoundFile(false); Document doc = new Document(); doc.add(Field.Text("contents", "The quick brown fox...")); writer.addDocument(doc); writer.optimize(); writer.close(); } catch (IOException e) { txn.abort(); txn = null; throw e; } catch (DatabaseException e) { if (txn != null) { txn.abort(); txn = null; } throw e; } finally { if (txn != null) txn.commit(); index.close(); blocks.close(); env.close(); } System.out.println("Indexing Complete"); } } /* --- BerkeleyDbSearcher.java --- */ package lia.tools; import com.sleepycat.db.EnvironmentConfig; import com.sleepycat.db.Environment; import com.sleepycat.db.Transaction; import com.sleepycat.db.Database; import com.sleepycat.db.DatabaseException; import org.apache.lucene.index.Term; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.TermQuery; import org.apache.lucene.store.db.DbDirectory; import java.io.File; import java.io.IOException; public class BerkeleyDbSearcher { public static void main(String[] args) throws IOException, DatabaseException { if (args.length != 1) { System.err.println("Usage: BerkeleyDbSearcher "); System.exit(-1); } File dbHome = new File(args[0]); EnvironmentConfig envConfig = new EnvironmentConfig(); envConfig.setTransactional(true); envConfig.setInitializeCache(true); envConfig.setInitializeL
Re: lucene integration with relational database
: Thanks for your tips. I am trying to get a more thorough understanding : why this would be better. 1) give serious consideration to just putting all of your data in lucene for the purposes of searching. the intial example mentioned employees, and salaries and wanted to search for employees with certain names, and salaries < $X ...lucene can do the "salaray < $X" using a RangeFilter. 2) assuming you *must* combine your lucene query with your SQL query... When your goal is performance, I don't think you'll ever be able to find a truely generic solution for all situations -- the specifics matter. For example: a) is your goal specifically to discount lucene results that don't meet a criteria specified in your DB? b) do you care about having an accurate number of total matches, or do you only care about "filtering" out results? depending on the answers, a fairly fast way to "eliminate" results is to only worry about the page of results you are looking at. Consider an employee search application which displays 10 results per page. first you do a lucene search by name, then you want to throw out any employees whose salary is below $X. use the Hits object from the lucene search to get the unique IDs for the first 10 employees (which uses a very small, fixed amount of memory and time, regardless of how big your index/result is) then do a lookup in your DB using a query built from those 10 IDs, ala: select ... from ... where ID in (1234, 5678 ... 7890) ...(which should also be very fast assuming your DB has a primary key on ID) if the 10 IDs all match your SQL query then you're done. If N don't match your query, then you need find the next N results from Hits that do; so just repeat the steps above untill you've gotten 10 viable results. (given good statistics on your data, you can virtually eliminate the need to execute more then a few iterations ... if nothing else, you can use the ratio or misses/hits from the first SQL query -- N of 10 didn't match -- to decide how big to make your second query to ensure you'll get N good ones.) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene integration with relational database
Hi, Andy, Thanks for your tips. I am trying to get a more thorough understanding why this would be better. It seems to me that the performance gain and memory reduction is because, you don't need to store all the lucene matched ids in memory. Is that right? Thanks, Jian So is it that the performance gain and memory reduction comes from the fact On Tue, 18 Jan 2005 11:22:39 -0800, Andy Goodell <[EMAIL PROTECTED]> wrote: > I do these kinds of queries all the time. I found that the fastest > performance for my collections (millions of documents) came from > subclassing Filter using the set of primary keys from the database to > make the Filter, and then doing the query with the > Searcher.search(query,filter) interface. I was previously using the > in memory merge, but the memory requirements were crashing the JVM > when we had a lot of simultaneous users. > > - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallellMultiSearcher Vs. One big Index
Ryan Aslett wrote: What I found was that for queries with one term (First Name), the large index beat the multiple indexes hands down (280 Queries/per second vs 170 Q/s). But for queries with multiple terms (Address), the multiple indexes beat out the Large index. (26 Q/s vs 16 Q/s) Btw, Im running these on a 2 proc box with 16GB of ram. So what Im trying to determine Is if there is some equations out there that can help me find the sweet spot for splitting my indexes. What appears to be the bottleneck, CPU or i/o? Is your test system multi-threaded? I.e., is it attempting to execute many queries in parallel? If you're CPU-bound then a single index should be fastest. Are you using compound format? If you're i/o-bound, the non-compound format may be somewhat faster, as it permits more parallel i/o. Is the index data on multiple drives? If you're i/o bound then it should be faster to use multiple drives. To permit even more parallel i/o over multiple drives you might consider using a pool of IndexReaders. That way, with, e.g., striped data, each could be simultaneously reading different portions of the same file. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene integration with relational database
I do these kinds of queries all the time. I found that the fastest performance for my collections (millions of documents) came from subclassing Filter using the set of primary keys from the database to make the Filter, and then doing the query with the Searcher.search(query,filter) interface. I was previously using the in memory merge, but the memory requirements were crashing the JVM when we had a lot of simultaneous users. - andy g On Sat, 15 Jan 2005 23:03:00 +0530, sunil goyal <[EMAIL PROTECTED]> wrote: > Hi all, > > Thanks for the answers. I was looking for a best practice guide to do > the same. If anyone already had had some practical experience with > such kind of queries, it will be great to know his thoughts. > > Thanks > > Regards > Sunil > > > On Sat, 15 Jan 2005 09:00:35 -0800, jian chen <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Still minor additions to the steps: > > > > 1) do lucene query and get the hits (keyed by the database primary > > key, for example, employee id) > > > > 2) do database query and get the primary keys (i.e., employee id) for > > the result rows, ordered by primary key > > > > 3) for each lucene query result, look into db query result and see if > > the primary key is there (since db query result is sorted already by > > primary key, so, a binary search could be applied) > > > > if the primary key is there, store this result, else, discard it > > > > 4) when top k results are obtained, send back to the user. > > > > How does this sound? > > > > Cheers, > > > > Jian > > > > On Sat, 15 Jan 2005 08:36:16 -0800, jian chen <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > To further the discussion. Would the following detailed steps work: > > > > > > 1) do lucene query and get the hits (keyed by the database primary > > > key, for example, employee id) > > > > > > 2) do database query and get the primary keys (i.e., employee id) for > > > the result rows, ordered by primary key > > > > > > 3) merge the two sets of primary keys (for example, in memory two-way > > > merge) and take the top k records > > > > > > 4) display the top k result rows > > > > > > Cheers, > > > > > > Jian > > > > > > On Sat, 15 Jan 2005 12:40:04 +, Peter Pimley <[EMAIL PROTECTED]> > > > wrote: > > > > sunil goyal wrote: > > > > > > > > >But can i do for instance a unified query where i want to take certain > > > > >parameters (non-textual e.g. age < 30 ) from relational databases and > > > > >keywords from the lucene index ? > > > > > > > > > > > > > > > > > > > When I have had to do this, I've done the lucene search first, and then > > > > manually filtered out the hits that fail on other criteria. > > > > > > > > I'd suggest doing that first (as it's easiest) and then seeing whether > > > > the performance is acceptable. > > > > > > > > - > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ParallellMultiSearcher Vs. One big Index
Okay, so Im trying to find the sweet spot on how many index segments I should have. I have 47 million records of contact data (Name + Address). I used 7 machines to build indexes that resulted in the following spread of individual indexes: 1503000 150 1497000 5604750 5379750 1437000 1458000 1446000 1422000 1425000 1425000 1404000 1413000 1404000 4893750 4689750 4519500 4497750 46919250 Total Records (The faster machines built the bigger indexes) I also joined all these indexes together into one large 47 million record index, and ran my query pounder against both data sets, one using the ParallellMultiSearcher for the multi indexes, and one using a normal IndexSearcher against the large index. What I found was that for queries with one term (First Name), the large index beat the multiple indexes hands down (280 Queries/per second vs 170 Q/s). But for queries with multiple terms (Address), the multiple indexes beat out the Large index. (26 Q/s vs 16 Q/s) Btw, Im running these on a 2 proc box with 16GB of ram. So what Im trying to determine Is if there is some equations out there that can help me find the sweet spot for splitting my indexes. Most queries are going to be multi-term, and clearly the big O of the single term search appears to be log n. (I verified with 470 million records.. The single term search returns at 140 qps, consistent with what I believe about search algorithms). The equation that Im missing is the big O for the union of the result sets that match particular terms. Im assuming (havent looked at the source yet) that lucene finds all the documents that match the first term, and all the documents that match each subsequent term, and then finds the union between all the sets. Is this correct? Anybody have any ideas on how to iron out an equation for this? Ryan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardAnalyzer unit tests?
On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote: I submitted a testcase -- http://issues.apache.org/bugzilla/show_bug.cgi?id=33134 I reviewed and applied your contributed unit test. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: How to get all field values from a Hits object?
Thank You very much --Tim > -Ursprüngliche Nachricht- > Von: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Auftrag von Chris > Hostetter > Gesendet: Dienstag, 18. Januar 2005 04:56 > An: Lucene Users List > Betreff: Re: How to get all field values from a Hits object? > > > > : is it possible to get all different values for a > : from a object and how to do this? > > The ording of your question suggests that the Field you are > interested in > isn't a field which will have a fairly unique value for every > doc (ie: not > a "title", more likely an "author" or "category" field). > Starting with > that assumption, then there is fairly efficient way to get > the information > you want... > > Assuming the total set of values for the Field you are > interested in is > small (relative your index size), you can pre-compute a BitSet for > each value indicating which docs match that value in the > Field (using a > TermFilter). Then store those BitSets in a Map (key'ed by > field value) > > Everytime a search is performed, use a HitCollector that generates a > BitSet containing the documents in your result; AND that > BitSet against (a > copy of) each BitSet in your Map. All of the resulting BitSets with a > non-zero cardinality represent values in your results. (As > an added bonus > the cardinality() of each BitSet is the total number of docs in your > result that contain that value) > > Two caveats: >1) Everytime you modify your index, you have to regen the > BitSets in your Map. >2) You have to know the set of all values for the field you are > interested in. In many cases, this is easy to > determine from the > source data while building the index. but it's also possible to > get it using IndexReader.termDocs(Term). > > > (I'm doing something like this to provide ancilary > information about which > categories of documents are most common in the users search > result, and > what the exact number of documents in those categories is) > > > -Hoss > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]