Re: Question about proximity searching and wildcards

2005-01-18 Thread Morus Walter
Mariella Di Giacomo writes:
> Hello,
> 
> We are using Lucene to index scientific articles.
> We are also using Luke to verify the fields and values we index.
> 
> One of the fields we index is the author field that consists of the authors 
> that have written the scientific article (an example of such data is shown 
> at the bottom of the email).
> 
> The most common search on the author field is the following:
> 
> "find all the authors whose last name starts with Cole and the first name 
> starts with S"
> 
> We thought of a proximity search (we want to make sure we take the first 
> name and not the middle name/initial) similar like that
> 
Query parser cannot do that.

> "Author:cole* S*"~1

In that case you cannot expand the wildcard terms.

> "Author:cole* AND Author:S*"~1

You cannot mix boolean queries and proximity queries.

What comes next to your query is phrase prefix query, but that's designed
for something like 'Cole S*' not 'Cole* S*'.

Searching for 'Cole* S*' means to search for all combinations of possible 
expansions of Cole and S. You can do that by expanding the terms yourself
but I'd expect that a) to be slow and b) to create trouble with the maximum
number of boolean terms (or memory usage).
Given that there are 10 expansions of Cole and 500 of S (that's not just
first names, that all names) you have to do 5000 proximity searches.
 
> If Luke cannot deal with that, when writing the query through the Java 
> application, which would be the
> query to be provided to get what expected ?
> Do we need to use a query filter ?
> 
I would use different fields for first and last name in this case.
And if it's relevant to search for the first character of the first name,
I'd index that additionally.

HTH
Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TermPositionVector

2005-01-18 Thread Siddharth Vijayakrishnan
Hi,

I am adding a field to a document in the index as follows

doc.add(new Field("contents",reader,Field.TermVector.WITH_POSITIONS))

Later,I query the index and get the document id of this document. The
following code, however, prints "false".

 TermFreqVector tfv = reader.getTermFreqVector(docId,"contents");
 System.out.println("Is a TermPositionVector  " + (tfv instanceof
TermPositionVector));

Using Field.TermVector.WITH_POSITIONS_OFFSETS, while creating the
field, also produces the same result.

Can someone tell me why this is happening ? 


Thanks,
Siddharth

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sharing lock files on multiple computers

2005-01-18 Thread Erik Hatcher
On Jan 18, 2005, at 8:09 PM, Chris Hostetter wrote:
: > ...which prompts me to wonder, how do people do this (ie: configure
: > lockDir such that processes on seperate physical computers respect
: > eachothers locks) without using NFS?

My question is: Given the assertion that it's not safe to keep lock
files on an NFS partition, what mechanism do/would/should people use to
enable two applications running on seperate physical machines to use 
the
same lock file directory?
I don't have experience with NFS, but the issue has cropped up numerous 
times on this e-mail list, and the general advice is "don't use Lucene 
on NFS drives, period", and that is why we provided that same advice in 
LIA.  However, I'm admittedly unknowledgeable in the reason the problem 
exists.

(this question is based on the understanding that unless the 
applications
are sharing the same lock directory, and index may be corrupted by
concurrent modifications, correct?)
Right - concurrent writes to the index could cause trouble.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: sharing lock files on multiple computers

2005-01-18 Thread Chris Hostetter
: > ...which prompts me to wonder, how do people do this (ie: configure
: > lockDir such that processes on seperate physical computers respect
: > eachothers locks) without using NFS?
:
: There is a system property that controls where the lock files are
: written:
:
:   http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head
: -59be30838bbb5692e605384b5f4c2f224f3dfa6f

Um, yeah.  I'm actually the one that added that FAQ answer last week :)

My question is: Given the assertion that it's not safe to keep lock
files on an NFS partition, what mechanism do/would/should people use to
enable two applications running on seperate physical machines to use the
same lock file directory?

(this question is based on the understanding that unless the applications
are sharing the same lock directory, and index may be corrupted by
concurrent modifications, correct?)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: sharing lock files on multiple computers

2005-01-18 Thread Erik Hatcher
On Jan 18, 2005, at 6:51 PM, Chris Hostetter wrote:
that said, the same paragraph of LIA does say...
   If you have multiple computers that need to access the same index
   stored on a shared disk, you should set the lock directory  
explicitly
   so that applications on different computers see each other's locks.

http://www.lucenebook.com/search? 
query=multiple+computers+%22see+each+other%27s+locks%22

...which prompts me to wonder, how do people do this (ie: configure
lockDir such that processes on seperate physical computers respect
eachothers locks) without using NFS?
There is a system property that controls where the lock files are  
written:

	http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head 
-59be30838bbb5692e605384b5f4c2f224f3dfa6f


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


sharing lock files on multiple computers

2005-01-18 Thread Chris Hostetter

LIA mentions that it's not a good idea to put Lock files on an NFS volume,
I can't think offhand of any specific examples of why this is bad, but
based on on my experience with NFS I'm not surprised by the advice either.

that said, the same paragraph of LIA does say...

   If you have multiple computers that need to access the same index
   stored on a shared disk, you should set the lock directory explicitly
   so that applications on different computers see each other's locks.

http://www.lucenebook.com/search?query=multiple+computers+%22see+each+other%27s+locks%22

...which prompts me to wonder, how do people do this (ie: configure
lockDir such that processes on seperate physical computers respect
eachothers locks) without using NFS?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'db' sandbox contribution update

2005-01-18 Thread PA
On Jan 19, 2005, at 00:02, Andi Vajda wrote:
Well, normally, if you're in a 100% Java situation, you could use the 
Berkeley
DB Java edition instead.
Alternatively, did anyone played with JDBM [1] to achieve the same 
result?

I'm not. I'm using the same code with Chandler, a
python program, and PyLucene (http://pylucene.osafoundation.org).
Chandler and PyLucene share the same database environment and this can 
only be
done if the C edition of Berkeley DB is the underlying db 
implementation.
I see.
By the way, is Chandler ever going to be released in our lifetime? :o)
While waiting for Godot, there is always Haystack [2].
Cheers
--
PA
http://alt.textdrive.com/
[1] http://jdbm.sourceforge.net/
[2] http://haystack.lcs.mit.edu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Ryan Aslett
The test system is not multithreaded currently, i.e. the queries are
executed serially.
Which explains why the multi-term, single index was slower.. Ie. Only
using one thread vs the parallel multisearcher using many.
I had plenty of CPU on the multi-term single index.  So if I were to
make my querier multithreaded, the fastest index configuration would
ideally be one big index?

Thanks you for your help!
Ryan


-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 18, 2005 11:32 AM
To: Lucene Users List
Subject: Re: ParallellMultiSearcher Vs. One big Index

Ryan Aslett wrote:
> What I found was that for queries with one term (First Name), the
large
> index beat the multiple indexes hands down (280 Queries/per second vs
> 170 Q/s).
> But for queries with multiple terms (Address), the multiple indexes
beat
> out the Large index. (26 Q/s vs 16 Q/s)
> Btw, Im running these on a 2 proc box with 16GB of ram.
> 
> So what Im trying to determine Is if there is some equations out there
> that can help me find the sweet spot for splitting my indexes.

What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'db' sandbox contribution update

2005-01-18 Thread Andi Vajda

Hmmm... out of curiosity... any reason not to use the Berkeley DB Java
Edition instead of the Java API to C Berkeley DB?
http://www.sleepycat.com/products/je.shtml
Well, normally, if you're in a 100% Java situation, you could use the 
Berkeley
DB Java edition instead. I'm not. I'm using the same code with Chandler, a
python program, and PyLucene (http://pylucene.osafoundation.org).
Chandler and PyLucene share the same database environment and this can only be
done if the C edition of Berkeley DB is the underlying db implementation.
There are three Java APIs for Berkeley DB available now:
   - Java API for C Berkeley DB 4.2.x
   - Java API for C Berkeley DB 4.3.x
   - Berkeley DB 100% Java Edition
These APIs are different from each other although 4.3.x and 100% Java are
close. Many months ago, somebody contacted me about rewriting DbDirectory for
the Java Edition of Berkeley DB, but I haven't heard from him in a long long
while.
Andi..
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: 'db' sandbox contribution update

2005-01-18 Thread PA
On Jan 18, 2005, at 22:26, Andi Vajda wrote:
With the release of Berkeley DB 4.3.x, Sleepycat radically changed the 
Java API to C Berkeley DB.
Hmmm... out of curiosity... any reason not to use the Berkeley DB Java 
Edition instead of the Java API to C Berkeley DB?

http://www.sleepycat.com/products/je.shtml
Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Question about proximity searching and wildcards

2005-01-18 Thread Mariella Di Giacomo
Hello,
We are using Lucene to index scientific articles.
We are also using Luke to verify the fields and values we index.
One of the fields we index is the author field that consists of the authors 
that have written the scientific article (an example of such data is shown 
at the bottom of the email).

The most common search on the author field is the following:
"find all the authors whose last name starts with Cole and the first name 
starts with S"

We thought of a proximity search (we want to make sure we take the first 
name and not the middle name/initial) similar like that

"Author:cole* S*"~1
"Author:cole* AND Author:S*"~1
What we were expecting was: all the documents that contain authors whose 
last name starts with Cole and the first name start with S and those words 
are near (next to each other)

Unfortunately when we type that search through the Luke "search interface" 
we do not get the expansion of the
words when using the similarity at the same time

So my question:
1) Does Luke cannot deal with that ?
2) Is the query not properly structured to get what expected ? Which would 
be the correct one ?

If Luke cannot deal with that, when writing the query through the Java 
application, which would be the
query to be provided to get what expected ?
Do we need to use a query filter ?

Thanks a lot in advance for your help,
Mariella


_
E.g.
Below there are three examples of data we index and to be precise the 
information related to the Authors field.

The following is the information related to scientific articles that we index.
1)
The Authors field consists of two authors
Title: Using Document Dimensions for Enhanced Information Retrieval
Authors: Jayasooriya, Thimala([EMAIL PROTECTED]); Manandhar, 
Suresha([EMAIL PROTECTED])
Affiliations: a. Department of Computer Science, University of York
Abstract (English): Conventional document search techniques are constrained 
by attempting to match individual keywords or phrases to source documents. 
Thus, these techniques miss out documents that contain semantically similar 
terms, thereby achieving a relatively low degree of recall. At the same 
time, processing capabilities and tools for syntactic and semantic analysis 
of language have advanced to the point where an index-time linguistic 
analysis of source documents is both feasible and realistic. In this paper, 
we introduce document dimensions, a means of classifying or grouping terms 
discovered in documents. Using an enhanced version of Jakarta Lucene[1], we 
demonstrate that supplementing keyword analysis with some syntactic and 
semantic information can indeed enhance the quality of information 
retrieval results.
Publisher: Springer-Verlag
Publication Type: Original Paper
ISSN: 0302-9743
ISBN: 3-540-23659-7
Book DOI: 10.1007/b101591

2)
The Authors field consists of six authors
Title: Multilingual Retrieval Experiments with MIMOR at the University of 
Hildesheim
Authors: Hackl, Renéa; Kölle, Ralpha; Mandl, 
Thomasa([EMAIL PROTECTED]); Ploedt, Alexandraa; Scheufen, 
Jan-Hendrika; Womser-Hacker, Christaa
Affiliations: a. University of Hildesheim, Information Science, 
Marienburger Platz 22, D-31141 Hildesheim
Abstract (English): Fusion and optimization based relevance judgements have 
proven to be successful strategies in information retrieval. In this years 
CLEF campaign we applied these strategies to multilingual retrieval with 
four languages. Our fusion experiments were carried out using freely 
available software. We used the snowball stemmers, internet translation 
services and the text retrieval tools in Lucene and the new MySQL.
Publisher: Springer-Verlag
Publication Type: Original Paper
ISSN: 0302-9743
ISBN: 3-540-24017-9
Book DOI: 10.1007/b102261

3)
The Authors field consists of one author and only middle and first initial 
are provided

Title: Letter to the editor
Author: Coleman, S.S.a
Affiliations: a. Department of Orthopaedics, The University of Utah School 
of Medicine, 50 North Medical Drive, Salt Lake City, UT 84132, USA US
Abstract: No Abstract
Publisher: Springer-Verlag
Item Identifier: 10.1007/s00264113
Publication Type: Article
ISSN: 0341-2695




Re: 'db' sandbox contribution update

2005-01-18 Thread Andi Vajda
 Jian,
I'd like to know when I use Lucene, normally under what condition I
should use the db (berkeley db) directory instead of using the
standard file system based directory?
Could you please let me know some brief comparisons of using berkeley
db vs. using file system and what is better?
Berkeley DB is a real database offering ACID transactions, FSDirectory is not. 
Berkeley DB can be very lightweight and is easily embedded in your 
application. For more information on Berkeley DB, see:
http://www.sleepycat.com.

When to use DbDirectory over FSDirectory really depends on your needs and 
constraints. If your index does not exceed the limits of your file system and 
you have no real concurrency needs then FSDirectory is fine. If you want/need
undoable transactions to wrap your index access calls, DbDirectory is probably 
a better choice.

Andi..
Thanks,
Jian
On Tue, 18 Jan 2005 13:26:16 -0800 (PST), Andi Vajda
<[EMAIL PROTECTED]> wrote:
With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java
API to C Berkeley DB. This is to announce that the updates to the DbDirectory
implementation I submitted were committed to the lucene sandbox at:
 http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db
I also updated the 'Lucene in Action' samples that illustrate how to use this
Berkeley DB-based implementation of org.apache.lucene.store.Directory.
They are included below.
Andi..
/* --- BerkeleyDbIndexer.java --- */
package lia.tools;
import com.sleepycat.db.EnvironmentConfig;
import com.sleepycat.db.Environment;
import com.sleepycat.db.Transaction;
import com.sleepycat.db.Database;
import com.sleepycat.db.DatabaseConfig;
import com.sleepycat.db.DatabaseType;
import com.sleepycat.db.DatabaseException;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.store.db.DbDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public class BerkeleyDbIndexer {
 public static void main(String[] args)
 throws IOException, DatabaseException
 {
 if (args.length < 1)
 {
 System.err.println("Usage: BerkeleyDbIndexer  -create");
 System.exit(-1);
 }
 String indexDir = args[0];
 boolean create = args.length == 2 ? args[1].equals("-create") : false;
 File dbHome = new File(indexDir);
 if (!dbHome.exists())
 dbHome.mkdir();
 else if (create)
 {
 File[] files = dbHome.listFiles();
 for (int i = 0; i < files.length; i++)
 if (files[i].getName().startsWith("__"))
 files[i].delete();
 }
 EnvironmentConfig envConfig = new EnvironmentConfig();
 DatabaseConfig dbConfig = new DatabaseConfig();
 envConfig.setTransactional(true);
 envConfig.setInitializeCache(true);
 envConfig.setInitializeLocking(true);
 envConfig.setInitializeLogging(true);
 envConfig.setLogInMemory(true);
 envConfig.setAllowCreate(true);
 envConfig.setThreaded(true);
 dbConfig.setAllowCreate(true);
 dbConfig.setType(DatabaseType.BTREE);
 Environment env = new Environment(dbHome, envConfig);
 Transaction txn = null;
 Database index, blocks;
 try {
 txn = env.beginTransaction(null, null);
 index = env.openDatabase(txn, "__index__", null, dbConfig);
 blocks = env.openDatabase(txn, "__blocks__", null, dbConfig);
 } catch (DatabaseException e) {
 if (txn != null)
 {
 txn.abort();
 txn = null;
 }
 throw e;
 } finally {
 if (txn != null)
 txn.commit();
 txn = null;
 }
 DbDirectory directory;
 IndexWriter writer;
 try {
 txn = env.beginTransaction(null, null);
 directory = new DbDirectory(txn, index, blocks);
 writer = new IndexWriter(directory, new StandardAnalyzer(), 
create);
 writer.setUseCompoundFile(false);
 Document doc = new Document();
 doc.add(Field.Text("contents", "The quick brown fox..."));
 writer.addDocument(doc);
 writer.optimize();
 writer.close();
 } catch (IOException e) {
 txn.abort();
 txn = null;
 throw e;
 } catch (DatabaseException e) {
 if (txn != null)
 {
 txn.abort();
 txn = null;
 }
 throw e;
 } finally {
 if (txn != null)
 txn.commit();
 index.close();
 blocks.close();
 env.close();
 }
 System.out.println("Indexi

Re: 'db' sandbox contribution update

2005-01-18 Thread jian chen
Hi, Andi,

I'd like to know when I use Lucene, normally under what condition I
should use the db (berkeley db) directory instead of using the
standard file system based directory?

Could you please let me know some brief comparisons of using berkeley
db vs. using file system and what is better?

Thanks,

Jian


On Tue, 18 Jan 2005 13:26:16 -0800 (PST), Andi Vajda
<[EMAIL PROTECTED]> wrote:
> 
> With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java
> API to C Berkeley DB. This is to announce that the updates to the DbDirectory
> implementation I submitted were committed to the lucene sandbox at:
>  http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db
> 
> I also updated the 'Lucene in Action' samples that illustrate how to use this
> Berkeley DB-based implementation of org.apache.lucene.store.Directory.
> They are included below.
> 
> Andi..
> 
> /* --- BerkeleyDbIndexer.java --- */
> 
> package lia.tools;
> 
> import com.sleepycat.db.EnvironmentConfig;
> import com.sleepycat.db.Environment;
> import com.sleepycat.db.Transaction;
> import com.sleepycat.db.Database;
> import com.sleepycat.db.DatabaseConfig;
> import com.sleepycat.db.DatabaseType;
> import com.sleepycat.db.DatabaseException;
> 
> import java.io.File;
> import java.io.IOException;
> 
> import org.apache.lucene.store.db.DbDirectory;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> 
> public class BerkeleyDbIndexer {
> 
>  public static void main(String[] args)
>  throws IOException, DatabaseException
>  {
>  if (args.length < 1)
>  {
>  System.err.println("Usage: BerkeleyDbIndexer  
> -create");
>  System.exit(-1);
>  }
> 
>  String indexDir = args[0];
>  boolean create = args.length == 2 ? args[1].equals("-create") : 
> false;
>  File dbHome = new File(indexDir);
> 
>  if (!dbHome.exists())
>  dbHome.mkdir();
>  else if (create)
>  {
>  File[] files = dbHome.listFiles();
> 
>  for (int i = 0; i < files.length; i++)
>  if (files[i].getName().startsWith("__"))
>  files[i].delete();
>  }
> 
>  EnvironmentConfig envConfig = new EnvironmentConfig();
>  DatabaseConfig dbConfig = new DatabaseConfig();
> 
>  envConfig.setTransactional(true);
>  envConfig.setInitializeCache(true);
>  envConfig.setInitializeLocking(true);
>  envConfig.setInitializeLogging(true);
>  envConfig.setLogInMemory(true);
>  envConfig.setAllowCreate(true);
>  envConfig.setThreaded(true);
>  dbConfig.setAllowCreate(true);
>  dbConfig.setType(DatabaseType.BTREE);
> 
>  Environment env = new Environment(dbHome, envConfig);
>  Transaction txn = null;
>  Database index, blocks;
> 
>  try {
>  txn = env.beginTransaction(null, null);
>  index = env.openDatabase(txn, "__index__", null, dbConfig);
>  blocks = env.openDatabase(txn, "__blocks__", null, dbConfig);
>  } catch (DatabaseException e) {
>  if (txn != null)
>  {
>  txn.abort();
>  txn = null;
>  }
>  throw e;
>  } finally {
>  if (txn != null)
>  txn.commit();
>  txn = null;
>  }
> 
>  DbDirectory directory;
>  IndexWriter writer;
> 
>  try {
>  txn = env.beginTransaction(null, null);
>  directory = new DbDirectory(txn, index, blocks);
>  writer = new IndexWriter(directory, new StandardAnalyzer(), 
> create);
>  writer.setUseCompoundFile(false);
> 
>  Document doc = new Document();
>  doc.add(Field.Text("contents", "The quick brown fox..."));
>  writer.addDocument(doc);
> 
>  writer.optimize();
>  writer.close();
>  } catch (IOException e) {
>  txn.abort();
>  txn = null;
>  throw e;
>  } catch (DatabaseException e) {
>  if (txn != null)
>  {
>  txn.abort();
>  txn = null;
>  }
>  throw e;
>  } finally {
>  if (txn != null)
>  txn.commit();
> 
>  index.close();
>  blocks.close();
>  env.close();
>  }
> 
>  System.out.println("Indexing Complete");
>  }
> }
> 
> /* --- BerkeleyDbSearcher.java --- */
> 
> package lia.tools;
> 
> import com.sleepycat.db.EnvironmentConfig;
> import com.sleepycat.db.Environment;
> import com.sleepycat.db.Transaction;
> import com.sleepycat.db.Database;
> import

'db' sandbox contribution update

2005-01-18 Thread Andi Vajda
With the release of Berkeley DB 4.3.x, Sleepycat radically changed the Java 
API to C Berkeley DB. This is to announce that the updates to the DbDirectory 
implementation I submitted were committed to the lucene sandbox at:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/db

I also updated the 'Lucene in Action' samples that illustrate how to use this
Berkeley DB-based implementation of org.apache.lucene.store.Directory.
They are included below.
Andi..
/* --- BerkeleyDbIndexer.java --- */
package lia.tools;
import com.sleepycat.db.EnvironmentConfig;
import com.sleepycat.db.Environment;
import com.sleepycat.db.Transaction;
import com.sleepycat.db.Database;
import com.sleepycat.db.DatabaseConfig;
import com.sleepycat.db.DatabaseType;
import com.sleepycat.db.DatabaseException;
import java.io.File;
import java.io.IOException;
import org.apache.lucene.store.db.DbDirectory;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public class BerkeleyDbIndexer {
public static void main(String[] args)
throws IOException, DatabaseException
{
if (args.length < 1)
{
System.err.println("Usage: BerkeleyDbIndexer  -create");
System.exit(-1);
}
String indexDir = args[0];
boolean create = args.length == 2 ? args[1].equals("-create") : false;
File dbHome = new File(indexDir);
if (!dbHome.exists())
dbHome.mkdir();
else if (create)
{
File[] files = dbHome.listFiles();
for (int i = 0; i < files.length; i++)
if (files[i].getName().startsWith("__"))
files[i].delete();
}
EnvironmentConfig envConfig = new EnvironmentConfig();
DatabaseConfig dbConfig = new DatabaseConfig();
envConfig.setTransactional(true);
envConfig.setInitializeCache(true);
envConfig.setInitializeLocking(true);
envConfig.setInitializeLogging(true);
envConfig.setLogInMemory(true);
envConfig.setAllowCreate(true);
envConfig.setThreaded(true);
dbConfig.setAllowCreate(true);
dbConfig.setType(DatabaseType.BTREE);
Environment env = new Environment(dbHome, envConfig);
Transaction txn = null;
Database index, blocks;
try {
txn = env.beginTransaction(null, null);
index = env.openDatabase(txn, "__index__", null, dbConfig);
blocks = env.openDatabase(txn, "__blocks__", null, dbConfig);
} catch (DatabaseException e) {
if (txn != null)
{
txn.abort();
txn = null;
}
throw e;
} finally {
if (txn != null)
txn.commit();
txn = null;
}
DbDirectory directory;
IndexWriter writer;
try {
txn = env.beginTransaction(null, null);
directory = new DbDirectory(txn, index, blocks);
writer = new IndexWriter(directory, new StandardAnalyzer(), create);
writer.setUseCompoundFile(false);
Document doc = new Document();
doc.add(Field.Text("contents", "The quick brown fox..."));
writer.addDocument(doc);
writer.optimize();
writer.close();
} catch (IOException e) {
txn.abort();
txn = null;
throw e;
} catch (DatabaseException e) {
if (txn != null)
{
txn.abort();
txn = null;
}
throw e;
} finally {
if (txn != null)
txn.commit();
index.close();
blocks.close();
env.close();
}
System.out.println("Indexing Complete");
}
}
/* --- BerkeleyDbSearcher.java --- */
package lia.tools;
import com.sleepycat.db.EnvironmentConfig;
import com.sleepycat.db.Environment;
import com.sleepycat.db.Transaction;
import com.sleepycat.db.Database;
import com.sleepycat.db.DatabaseException;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.db.DbDirectory;
import java.io.File;
import java.io.IOException;
public class BerkeleyDbSearcher {
public static void main(String[] args)
throws IOException, DatabaseException
{
if (args.length != 1)
{
System.err.println("Usage: BerkeleyDbSearcher ");
System.exit(-1);
}
File dbHome = new File(args[0]);
EnvironmentConfig envConfig = new EnvironmentConfig();
envConfig.setTransactional(true);
envConfig.setInitializeCache(true);
envConfig.setInitializeL

Re: lucene integration with relational database

2005-01-18 Thread Chris Hostetter

: Thanks for your tips. I am trying to get a more thorough understanding
: why this would be better.

1) give serious consideration to just putting all of your data in lucene
for the purposes of searching.  the intial example mentioned employees,
and salaries and wanted to search for employees with certain names, and
salaries < $X ...lucene can do the "salaray < $X" using a RangeFilter.

2) assuming you *must* combine your lucene query with your SQL query...

When your goal is performance, I don't think you'll ever be able to
find a truely generic solution for all situations -- the specifics matter.


For example:

  a) is your goal specifically to discount lucene results that don't meet
 a criteria specified in your DB?
  b) do you care about having an accurate number of total matches, or do
 you only care about "filtering" out results?

depending on the answers, a fairly fast way to "eliminate" results is to
only worry about the page of results you are looking at.  Consider an
employee search application which displays 10 results per page.  first you
do a lucene search by name, then you want to throw out any employees whose
salary is below $X.  use the Hits object from the lucene search to get the
unique IDs for the first 10 employees (which uses a very small, fixed
amount of memory and time, regardless of how big your index/result is)
then do a lookup in your DB using a query built from those 10 IDs, ala:

   select ... from ... where ID in (1234, 5678 ... 7890)

...(which should also be very fast assuming your DB has a primary key on
ID)

if the 10 IDs all match your SQL query then you're done.  If N don't match
your query, then you need find the next N results from Hits that do; so
just repeat the steps above untill you've gotten 10 viable results.

(given good statistics on your data, you can virtually eliminate the need
to execute more then a few iterations ... if nothing else, you can use the
ratio or misses/hits from the first SQL query -- N of 10 didn't match --
to decide how big to make your second query to ensure you'll get N good
ones.)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene integration with relational database

2005-01-18 Thread jian chen
Hi, Andy,

Thanks for your tips. I am trying to get a more thorough understanding
why this would be better.

It seems to me that the performance gain and memory reduction is
because, you don't need to store all the lucene matched ids in memory.
Is that right?

Thanks,

Jian

So is it that the performance gain and memory reduction comes from the fact 

On Tue, 18 Jan 2005 11:22:39 -0800, Andy Goodell <[EMAIL PROTECTED]> wrote:
> I do these kinds of queries all the time.  I found that the fastest
> performance for my collections (millions of documents) came from
> subclassing Filter using the set of primary keys from the database to
> make the Filter, and then doing the query with the
> Searcher.search(query,filter) interface.  I was previously using the
> in memory merge, but the memory requirements were crashing the JVM
> when we had a lot of simultaneous users.
> 
> - andy g

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Doug Cutting
Ryan Aslett wrote:
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.
So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes.
What appears to be the bottleneck, CPU or i/o?  Is your test system 
multi-threaded?  I.e., is it attempting to execute many queries in 
parallel?  If you're CPU-bound then a single index should be fastest. 
Are you using compound format?  If you're i/o-bound, the non-compound 
format may be somewhat faster, as it permits more parallel i/o.  Is the 
index data on multiple drives?  If you're i/o bound then it should be 
faster to use multiple drives.  To permit even more parallel i/o over 
multiple drives you might consider using a pool of IndexReaders.  That 
way, with, e.g., striped data, each could be simultaneously reading 
different portions of the same file.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene integration with relational database

2005-01-18 Thread Andy Goodell
I do these kinds of queries all the time.  I found that the fastest
performance for my collections (millions of documents) came from
subclassing Filter using the set of primary keys from the database to
make the Filter, and then doing the query with the
Searcher.search(query,filter) interface.  I was previously using the
in memory merge, but the memory requirements were crashing the JVM
when we had a lot of simultaneous users.

- andy g


On Sat, 15 Jan 2005 23:03:00 +0530, sunil goyal <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> Thanks for the answers. I was looking for a best practice guide to do
> the same. If anyone already had had some practical experience with
> such kind of queries, it will be great to know his thoughts.
> 
> Thanks
> 
> Regards
> Sunil
> 
> 
> On Sat, 15 Jan 2005 09:00:35 -0800, jian chen <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Still minor additions to the steps:
> >
> > 1) do lucene query and get the hits (keyed by the database primary
> > key, for example, employee id)
> >
> > 2) do database query and get the primary keys (i.e., employee id) for
> > the result rows, ordered by primary key
> >
> > 3) for each lucene query result, look into db query result and see if
> > the primary key is there (since db query result is sorted already by
> > primary key, so, a binary search could be applied)
> >
> > if the primary key is there, store this result, else, discard it
> >
> > 4) when top k results are obtained, send back to the user.
> >
> > How does this sound?
> >
> > Cheers,
> >
> > Jian
> >
> > On Sat, 15 Jan 2005 08:36:16 -0800, jian chen <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > To further the discussion. Would the following detailed steps work:
> > >
> > > 1) do lucene query and get the hits (keyed by the database primary
> > > key, for example, employee id)
> > >
> > > 2) do database query and get the primary keys (i.e., employee id) for
> > > the result rows, ordered by primary key
> > >
> > > 3) merge the two sets of primary keys (for example, in memory two-way
> > > merge) and take the top k records
> > >
> > > 4) display the top k result rows
> > >
> > > Cheers,
> > >
> > > Jian
> > >
> > > On Sat, 15 Jan 2005 12:40:04 +, Peter Pimley <[EMAIL PROTECTED]> 
> > > wrote:
> > > > sunil goyal wrote:
> > > >
> > > > >But can i do for instance a unified query where i want to take certain
> > > > >parameters (non-textual e.g. age < 30 ) from relational databases and
> > > > >keywords from the lucene index ?
> > > > >
> > > > >
> > > > >
> > > > When I have had to do this, I've done the lucene search first, and then
> > > > manually filtered out the hits that fail on other criteria.
> > > >
> > > > I'd suggest doing that first (as it's easiest) and then seeing whether
> > > > the performance is acceptable.
> > > >
> > > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ParallellMultiSearcher Vs. One big Index

2005-01-18 Thread Ryan Aslett
 
Okay, so Im trying to find the sweet spot on how many index segments I
should have.

I have 47 million records of contact data (Name + Address). I used 7
machines to build indexes that resulted in the following spread of
individual indexes:

1503000
150
1497000
5604750
5379750
1437000
1458000
1446000
1422000
1425000
1425000
1404000
1413000
1404000
4893750
4689750
4519500
4497750
46919250 Total Records
(The faster machines built the bigger indexes)
I also joined all these indexes together into one large 47 million
record index, and ran my query pounder against both data sets, one using
the ParallellMultiSearcher for the multi indexes, and one using a normal
IndexSearcher against the large index.
What I found was that for queries with one term (First Name), the large
index beat the multiple indexes hands down (280 Queries/per second vs
170 Q/s).
But for queries with multiple terms (Address), the multiple indexes beat
out the Large index. (26 Q/s vs 16 Q/s)
Btw, Im running these on a 2 proc box with 16GB of ram.

So what Im trying to determine Is if there is some equations out there
that can help me find the sweet spot for splitting my indexes. Most
queries are going to be multi-term, and clearly the big O of the single
term search appears to be log n. (I verified with 470 million records..
The single term search returns at 140 qps, consistent with what I
believe about search algorithms).  The equation that Im missing is the
big O for the union of the result sets that match particular terms.  Im
assuming (havent looked at the source yet) that lucene finds all the
documents that match the first term, and all the documents that match
each subsequent term, and then finds the union between all the sets. Is
this correct?  Anybody have any ideas on how to iron out an equation for
this?

Ryan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardAnalyzer unit tests?

2005-01-18 Thread Erik Hatcher
On Jan 17, 2005, at 4:51 AM, Chris Lamprecht wrote:
I submitted a testcase --
http://issues.apache.org/bugzilla/show_bug.cgi?id=33134
I reviewed and applied your contributed unit test.  Thanks!
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


AW: How to get all field values from a Hits object?

2005-01-18 Thread Tim Lebedkov \(UPK\)
Thank You very much

--Tim

> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Auftrag von Chris 
> Hostetter
> Gesendet: Dienstag, 18. Januar 2005 04:56
> An: Lucene Users List
> Betreff: Re: How to get all field values from a Hits object?
> 
> 
> 
> : is it possible to get all different values for a
> :  from a  object and how to do this?
> 
> The ording of your question suggests that the Field you are 
> interested in
> isn't a field which will have a fairly unique value for every 
> doc (ie: not
> a "title", more likely an "author" or "category" field).  
> Starting with
> that assumption, then there is fairly efficient way to get 
> the information
> you want...
> 
> Assuming the total set of values for the Field you are 
> interested in is
> small (relative your index size), you can pre-compute a BitSet for
> each value indicating which docs match that value in the 
> Field (using a
> TermFilter).  Then store those BitSets in a Map (key'ed by 
> field value)
> 
> Everytime a search is performed, use a HitCollector that generates a
> BitSet containing the documents in your result; AND that 
> BitSet against (a
> copy of) each BitSet in your Map.  All of the resulting BitSets with a
> non-zero cardinality represent values in your results.  (As 
> an added bonus
> the cardinality() of each BitSet is the total number of docs in your
> result that contain that value)
> 
> Two caveats:
>1) Everytime you modify your index, you have to regen the
>   BitSets in your Map.
>2) You have to know the set of all values for the field you are
>   interested in.  In many cases, this is easy to 
> determine from the
>   source data while building the index.  but it's also possible to
>   get it using IndexReader.termDocs(Term).
> 
> 
> (I'm doing something like this to provide ancilary 
> information about which
> categories of documents are most common in the users search 
> result, and
> what the exact number of documents in those categories is)
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]