Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John,

Once you make your change locally, use 'cvs diff -u IndexWriter.java 
indexwriter.patch' to make a patch.
Then open a new Bugzilla entry.
Finally, attach your patch to that entry.

Note that Document deletion is actually done from IndexReader, so your
patch may have to be on IndexReader, not IndexWriter.

Thanks,
Otis


--- John Wang [EMAIL PROTECTED] wrote:

 Hi Otis:
 
  Thanks for you reply.
 
  I am looking for more of an API call than a tool. e.g.
 IndexWriter.finalizeDelete()
 
  If I implement this, how would I go about submitting a patch?
 
 thanks
 
 -John
 
 
 On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic
 [EMAIL PROTECTED] wrote:
  Hello John,
  
  I believe you didn't get any replies to this.  What you are
 describing
  cannot be done using the public, but maaay (no source code on this
  machine, so I can't double-check that) be doable if you use some of
 the
  'internal' methods.
  
  I don't have the need for this, but others might, so it may be
 worth
  developing a tool that purges Documents marked as deleted without
 the
  expensive segment merging, iff that is possible.  If you put this
 tool
  under the approprite org.apache.lucene... package, you'll get
 access to
  'internal' methods, of course.  If you end up creating this, we
 could
  stick it in the Sandbox, where we should really create a new
 section
  for handy command-line tools that manipulate the index.
  
  Otis
  
  
  
  
  --- John Wang [EMAIL PROTECTED] wrote:
  
   Hi:
  
  Is there a way to finalize delete, e.g. actually remove them
 from
   the segments and make sure the docIDs are contiguous again.
  
  The only explicit way to do this is by calling
   IndexWriter.optmize(). But this call does a lot more (also merges
 all
   the segments), hence is very expensive. Is there a way to simply
 just
   finalize the deletes without having to merge all the segments?
  
   If not, I'd be glad to submit an implementation of this
 feature
   if
   the Lucene devs agree this is useful.
  
   Thanks
  
   -John
  
  
 -
   To unsubscribe, e-mail:
 [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Akmal Sarhan
that sounds very interesting but how do you handle queries like
select * from MY_TABLE where MY_NUMERIC_FIELD  80

as far as I know you have only the range query so you will have to say

my_numeric_filed:[80 TO ??]
but this would not work in the a/m example or am I missing something?

regards

Akmal
Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07:
 Even we use lucene for similar purpose except that we index and store quite 
 a few fields. Infact I also update partial documents as people suggested. I 
 store all the indexed fields so I don't have to build the whole document 
 again while updating partial document. The reason we do this is due to the 
 speed. I found the lucene search on a millions objects is 4 to 5 times 
 faster than our oracle queries (ofcourse this might be due to our pitiful 
 database design :) ). It works great so far. the only caveat that we had 
 till now was incremental updates. But now I am implementing real-time 
 updates so that the data in lucene index is almost always in sync with data 
 in database. So now, our search does not goto the database at all.
 
 Praveen
 - Original Message - 
 From: Kevin L. Cobb [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, December 14, 2004 9:40 AM
 Subject: Opinions: Using Lucene as a thin database
 
 
 I use Lucene as a legitimate search engine which is cool. But, I am also
 using it as a simple database too. I build an index with a couple of
 keyword fields that allows me to retrieve values based on exact matches
 in those fields. This is all I need to do so it works just fine for my
 needs. I also love the speed. The index is small enough that it is
 wicked fast. Was wondering if anyone out there was doing the same of it
 there are any dissenting opinions on using Lucene for this purpose.
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 !EXCUBATOR:41bf0221115901292611315!
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Hmm. So far all our fields are just strings. But I would guess you should be 
able to use Integer.MAX_VALUE or something on the upper bound. Or there 
might be a better way of doing it.

Praveen
- Original Message - 
From: Akmal Sarhan [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 10:23 AM
Subject: Re: Opinions: Using Lucene as a thin database


that sounds very interesting but how do you handle queries like
select * from MY_TABLE where MY_NUMERIC_FIELD  80
as far as I know you have only the range query so you will have to say
my_numeric_filed:[80 TO ??]
but this would not work in the a/m example or am I missing something?
regards
Akmal
Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07:
Even we use lucene for similar purpose except that we index and store 
quite
a few fields. Infact I also update partial documents as people suggested. 
I
store all the indexed fields so I don't have to build the whole document
again while updating partial document. The reason we do this is due to 
the
speed. I found the lucene search on a millions objects is 4 to 5 times
faster than our oracle queries (ofcourse this might be due to our pitiful
database design :) ). It works great so far. the only caveat that we had
till now was incremental updates. But now I am implementing real-time
updates so that the data in lucene index is almost always in sync with 
data
in database. So now, our search does not goto the database at all.

Praveen
- Original Message - 
From: Kevin L. Cobb [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:40 AM
Subject: Opinions: Using Lucene as a thin database

I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
!EXCUBATOR:41bf0221115901292611315!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread petite_abeille
On Dec 14, 2004, at 15:40, Kevin L. Cobb wrote:
Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.
ZOE [1] [2] takes the same approach and uses Lucene as a relational 
engine of sort.

However, for both practical and ideological reasons, its does not store 
any raw data in the Lucene indices themselves but instead uses JDBM [2] 
for that purpose.

All things considered, update issues aside, Lucene turns out to be a 
very flexible thin database.

Cheers,
PA.
[1] http://zoe.nu/
[2] http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/
[3] http://jdbm.sourceforge.net/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
Christoph,
I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox 
Ot oh, sorry, I'll try to get this checked in soonish. For me it's 
always one thing to do a prelim version of a piece of code, but another 
matter to get it correctly packasged.

so I've attached it here. Just repackage and test.

An alternate approach to find similar docs is to use all (possibly 
unique) tokens in the  source doc to form a large query. This is code I use:

'srch' is the entire untokenized text of the source doc
'a' is the analyzer you want to use
'field' is the field you want to search on e.g. contents or body
'stop' is an opt set of stop words to ignore
It returns a query, which you then use to search for similar docs, and 
then in the return result you need to make sure you ignore the source 
doc, which will prob come back 1st. You can use stemming, synonyms, or 
fuzzy expansion for each term too.

public static Query formSimilarQuery( 
String srch,			Analyzer a,	String field,	Set stop)
throws org.apache.lucene.queryParser.ParseException, IOException
{	
	TokenStream ts = a.tokenStream( foo, new StringReader( srch));
	org.apache.lucene.analysis.Token t;
	BooleanQuery tmp = new BooleanQuery();
	Set already = new HashSet();
	while ( (t = ts.next()) != null)
	{
		String word = t.termText();
		if ( stop != null 
			 stop.contains( word)) continue;
		if ( ! already.add( word)) continue;
		TermQuery tq = new TermQuery( new Term( field, word));
		tmp.add( tq, false, false);
	}
	return tmp;

}

Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [RFE] IndexWriter.updateDocument()

2004-12-14 Thread David Spencer
petite_abeille wrote:
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's 
updating documents, therefore this Request For Enhancement:

Please consider enhancing the IndexWriter API to include an 
updateDocument(...) method to take care of all the gory details involved 
in such operation.
I agree, this is always a hassle to do right due to having to use 
IndexWriter and IndexReader and properly opening/closing them.

I have a prelim version of a batched index writer that I use. The code 
is kinda messy, but for discussion here's what it does:

Briefly the methods are:
// [1]
// the ctr has parameters:
//'batch size # docs' e.g. it will flush pending updates every 100 docs
//'batch freq' e.g. auto flush every 60 sec
// [2]
// queue a document to be added to the index
// 'key' is the primary key name e.g. url
// 'val' is the primary key val e.g. http://www.tropo.com/;
// 'doc' is the doc to be added
update( String key, String val, Document doc)
// [3]
//  queue a document for removal
// 'key' and 'val' are the params, as from [2]
remove( String key, String val)
// [4]
// periodic flush, called automatically or on demand, 2 stages:
// 1. call IndexReader.delete() on all pending (key,val) pairs
// 2. close IndexReader
// 3. call IndexWriter.add() on all pending documents
// 4. optionally call optimze()
// 5. close IndexWriter
flush()

//
So in normal usage you just keep calling update() and it peridically 
flushes the pending updates to the index. By its nature this uses memory 
however it's tunable as to how many documents it'll queue in memory.

Does the algorithm above, esp flush(), sound correct? It seems to work 
right for me and I can post this if people want to see it...

- Dave

Thanks in advance.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Monsur Hossain
 My concern is that this just shifts the scaling issue to 
 Lucene, and I haven't found much info on how to scale Lucene 
 vertically.  

By vertically, of course, I meant horizontally.  Basically scaling
it across servers as one might do with a relational database.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Chris Hostetter
: select * from MY_TABLE where MY_NUMERIC_FIELD  80
:
: as far as I know you have only the range query so you will have to say
:
: my_numeric_filed:[80 TO ??]
: but this would not work in the a/m example or am I missing something?

RangeQuery allows you to an open ended range -- you can tell the
QueryParser to leave your range opened ended using hte keyword null,
ie...

my_numeric_filed:[80 TO null]



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Otis Gospodnetic wrote:
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig doc, to 
form the similarity query. Also I like the way you exclude the source 
doc in the query, I didn't think of doing that in my code.

I don't trust calling vector.size() and vector.getTerms() within the 
loop but I haven't looked at the code to see if it calculates  the 
results each time or caches them...
Otis
--- Bruce Ritchie [EMAIL PROTECTED] wrote:

Christoph,
I'm not entirely certain if this is what you want, but a while back
David Spencer did code up a 'More Like This' class which can be used
for generating similarities between documents. I can't seem to find
this class in the sandbox so I've attached it here. Just repackage
and test.
Regards,
Bruce Ritchie
http://www.jivesoftware.com/   


-Original Message-
From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
Sent: December 14, 2004 11:45 AM
To: Lucene Users List
Subject: TFIDF Implementation

Hi,
My current task/problem is the following: I need to implement 
TFIDF document term ranking using Jakarta Lucene to compute a 
similarity rank between arbitrary documents in the constructed
index.
I saw from the API that there are similar functions already 
implemented in the class Similarity and DefaultSimilarity but 
I don't know exactly how to use them. At the time my index 
has about 25000 (small) documents and there are about 75000 
terms stored in total.
Now, my question is simple. Does anybody has done this before 
or could point me to another location for help?

Thanks for any help in advance.
Christoph 

--
Christoph Kiefer
Department of Informatics, University of Zurich
Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


[RFE] IndexWriter.updateDocument()

2004-12-14 Thread petite_abeille
Well, the subject says it all...
If there is one thing which is overly cumbersome in Lucene, it's 
updating documents, therefore this Request For Enhancement:

Please consider enhancing the IndexWriter API to include an 
updateDocument(...) method to take care of all the gory details 
involved in such operation.

Thanks in advance.
Cheers,
PA.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: TFIDF Implementation

2004-12-14 Thread Otis Gospodnetic
You can also see 'Books like this' example from here
https://secure.manning.com/catalog/view.php?book=hatcher2item=source

Otis

--- Bruce Ritchie [EMAIL PROTECTED] wrote:

 Christoph,
 
 I'm not entirely certain if this is what you want, but a while back
 David Spencer did code up a 'More Like This' class which can be used
 for generating similarities between documents. I can't seem to find
 this class in the sandbox so I've attached it here. Just repackage
 and test.
 
 
 Regards,
 
 Bruce Ritchie
 http://www.jivesoftware.com/   
 
  -Original Message-
  From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
  Sent: December 14, 2004 11:45 AM
  To: Lucene Users List
  Subject: TFIDF Implementation
  
  Hi,
  My current task/problem is the following: I need to implement 
  TFIDF document term ranking using Jakarta Lucene to compute a 
  similarity rank between arbitrary documents in the constructed
 index.
  I saw from the API that there are similar functions already 
  implemented in the class Similarity and DefaultSimilarity but 
  I don't know exactly how to use them. At the time my index 
  has about 25000 (small) documents and there are about 75000 
  terms stored in total.
  Now, my question is simple. Does anybody has done this before 
  or could point me to another location for help?
  
  Thanks for any help in advance.
  Christoph 
  
  --
  Christoph Kiefer
  
  Department of Informatics, University of Zurich
  
  Office: Uni Irchel 27-K-32
  Phone:  +41 (0) 44 / 635 67 26
  Email:  [EMAIL PROTECTED]
  Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 
-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
 
  You can also see 'Books like this' example from here 
  
 https://secure.manning.com/catalog/view.php?book=hatcher2item=source
 
 Well done, uses a term vector, instead of reparsing the orig 
 doc, to form the similarity query. Also I like the way you 
 exclude the source doc in the query, I didn't think of doing 
 that in my code.

I agree, it's a good way to exclude the source doc.
 
 I don't trust calling vector.size() and vector.getTerms() 
 within the loop but I haven't looked at the code to see if it 
 calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
Well, one could always partition an index, distribute pieces of it
horizontally across multiple 'search servers' and use the built-in
RMI-based and Parallel search feature.  Nutch uses something similar
for search scaling.

Otis


--- Monsur Hossain [EMAIL PROTECTED] wrote:

  My concern is that this just shifts the scaling issue to 
  Lucene, and I haven't found much info on how to scale Lucene 
  vertically.  
 
 By vertically, of course, I meant horizontally.  Basically
 scaling
 it across servers as one might do with a relational database.
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
  From the code I looked at, those calls don't recalculate on 
 every call. 
 
 I was referring to this fragment below from BooksLikeThis.docsLike(), 
 and was mentioning it as the javadoc 
 http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
 dex/TermFreqVector.html 
 does not say that the values returned by size() and getTerms() are 
 cached, and while the impl may cache them (haven't checked) it's not 
 guarenteed, thus it's safer to put the size() and getTerms() call 
 outside the loop.
 
   for (int j = 0; j  vector.size(); j++) {
TermQuery tq = new TermQuery(
new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls 
outside of the loop, I was just saying that I did look at the implementation 
and the calls do not recalculate anything. I'm sorry I didn't explain myself 
clearly enough.


Regards,

Bruce Ritchie

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
From the code I looked at, those calls don't recalculate on 
every call. 

I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in
dex/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough.
Oh oh oh, sorry, 10-4, no prob.

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-14 Thread David Spencer
Bruce Ritchie wrote:
 

You can also see 'Books like this' example from here 

https://secure.manning.com/catalog/view.php?book=hatcher2item=source
Well done, uses a term vector, instead of reparsing the orig 
doc, to form the similarity query. Also I like the way you 
exclude the source doc in the query, I didn't think of doing 
that in my code.

I agree, it's a good way to exclude the source doc.
 

I don't trust calling vector.size() and vector.getTerms() 
within the loop but I haven't looked at the code to see if it 
calculates  the results each time or caches them...

From the code I looked at, those calls don't recalculate on every call. 
I was referring to this fragment below from BooksLikeThis.docsLike(), 
and was mentioning it as the javadoc 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html 
does not say that the values returned by size() and getTerms() are 
cached, and while the impl may cache them (haven't checked) it's not 
guarenteed, thus it's safer to put the size() and getTerms() call 
outside the loop.

 for (int j = 0; j  vector.size(); j++) {
  TermQuery tq = new TermQuery(
  new Term(subject, vector.getTerms()[j]));

Regards,
Bruce Ritchie
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i  BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



A question about scoring function in Lucene

2004-12-14 Thread Nhan Nguyen Dang
Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the 
boost_t, coord_q_d factor)

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)

Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:

I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier 
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):

 sum_t( weight_t_d * weight_t_q) 
score_d(d, q)=  - (**)
  norm_d * norm_q 

where weight_t_d = tf_d * idf_t
  weight_t_q = tf_q * idf_t
  norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )

(**):  sum_t( tf_q*idf_t * tf_d*idf_t) 
score_d(d, q)=-  (***)
   norm_d * norm_q 

The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.

2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?

One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .

Best regard,
Thanks every body,

=
Ð#7863;ng Nhân 






__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-14 Thread Karthik N S
Hi Guys


Some body tell me what this Exception am Getting Pleae

Sys Specifications

O/s Linux Gentoo
Appserver Apache Tomcat/4.1.24
Jdk build 1.4.2_03-b02
Lucene 1.4.1 ,2, 3

Note: - This Exception is displayed on Every 2nd Query after Tomcat is
started


java.io.IOException: Stale NFS file handle
at java.io.RandomAccessFile.readBytes(Native Method)
at java.io.RandomAccessFile.read(RandomAccessFile.java:307)
at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420)
at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou
ndFileReader.java:220)
at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142)
at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137)
at
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253)
at
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69)
at org.apache.lucene.search.Similarity.idf(Similarity.java:255)
at
org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery.
java:47)
at org.apache.lucene.search.Query.weight(Query.java:86)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:
251)





  WITH WARM REGARDS
  HAVE A NICE DAY
  [ N.S.KARTHIK]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: A question about scoring function in Lucene

2004-12-14 Thread Chuck Williams
Nhan,

Re.  your two differences:

1 is not a difference.  Norm_d and Norm_q are both independent of t, so summing 
over t has no effect on them.  I.e., Norm_d * Norm_q is constant wrt the 
summation, so it doesn't matter if the sum is over just the numerator or over 
the entire fraction, the result is the same.

2 is a difference.  Lucene uses Norm_q instead of Norm_d because Norm_d is too 
expensive to compute, especially in the presence of incremental indexing.  
E.g., adding or deleting any document changes the idf's, so if Norm_d was used 
it would have to be recomputed for ALL documents.  This is not feasible.

Another point you did not mention is that the idf term is squared (in both of 
your formulas).  Salton, the originator of the vector space model, dropped one 
idf factor from his formula as it improved results empirically.  More recent 
theoretical justifications of tf*idf provide intuitive explanations of why idf 
should only be included linearly.  tf is best thought of as the real vector 
entry, while idf is a weighting term on the components of the inner product.  
E.g., seen the excellent paper by Robertson, Understanding inverse document 
frequency: on theoretical arguments for IDF, available here:  
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval.

It's easy to correct for idf^2 by using a customer Similarity that takes a 
final square root.

Chuck

   -Original Message-
   From: Vikas Gupta [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 14, 2004 9:32 PM
   To: Lucene Users List
   Subject: Re: A question about scoring function in Lucene
   
   Lucene uses the vector space model. To understand that:
   
   -Read section 2.1 of Space optimizations for Total Ranking paper
   (Linked
   here http://lucene.sourceforge.net/publications.html)
   -Read section 6 to 6.4 of
   http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
   -Read section 1 of
   http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
   
   Vikas
   
   On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
   
Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the
boost_t, coord_q_d factor)
   
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)
   
Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:
   
I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):
   
   sum_t( weight_t_d * weight_t_q)
score_d(d, q)=  - (**)
norm_d * norm_q
   
where weight_t_d = tf_d * idf_t
  weight_t_q = tf_q * idf_t
  norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
   
(**):  sum_t( tf_q*idf_t * tf_d*idf_t)
score_d(d, q)=-  (***)
 norm_d * norm_q
   
The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.
   
2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?
   
One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .
   
Best regard,
Thanks every body,
   
=
Ð#7863;ng Nhân
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: finalize delete without optimize

2004-12-14 Thread Otis Gospodnetic
Hello John,

I believe you didn't get any replies to this.  What you are describing
cannot be done using the public, but maaay (no source code on this
machine, so I can't double-check that) be doable if you use some of the
'internal' methods.  

I don't have the need for this, but others might, so it may be worth
developing a tool that purges Documents marked as deleted without the
expensive segment merging, iff that is possible.  If you put this tool
under the approprite org.apache.lucene... package, you'll get access to
'internal' methods, of course.  If you end up creating this, we could
stick it in the Sandbox, where we should really create a new section
for handy command-line tools that manipulate the index.

Otis


--- John Wang [EMAIL PROTECTED] wrote:

 Hi:
 
Is there a way to finalize delete, e.g. actually remove them from
 the segments and make sure the docIDs are contiguous again.
 
The only explicit way to do this is by calling
 IndexWriter.optmize(). But this call does a lot more (also merges all
 the segments), hence is very expensive. Is there a way to simply just
 finalize the deletes without having to merge all the segments?
 
 If not, I'd be glad to submit an implementation of this feature
 if
 the Lucene devs agree this is useful.
 
 Thanks
 
 -John
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: finalize delete without optimize

2004-12-14 Thread John Wang
Hi Otis:

 Thanks for you reply.

 I am looking for more of an API call than a tool. e.g.
IndexWriter.finalizeDelete()

 If I implement this, how would I go about submitting a patch?

thanks

-John


On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Hello John,
 
 I believe you didn't get any replies to this.  What you are describing
 cannot be done using the public, but maaay (no source code on this
 machine, so I can't double-check that) be doable if you use some of the
 'internal' methods.
 
 I don't have the need for this, but others might, so it may be worth
 developing a tool that purges Documents marked as deleted without the
 expensive segment merging, iff that is possible.  If you put this tool
 under the approprite org.apache.lucene... package, you'll get access to
 'internal' methods, of course.  If you end up creating this, we could
 stick it in the Sandbox, where we should really create a new section
 for handy command-line tools that manipulate the index.
 
 Otis
 
 
 
 
 --- John Wang [EMAIL PROTECTED] wrote:
 
  Hi:
 
 Is there a way to finalize delete, e.g. actually remove them from
  the segments and make sure the docIDs are contiguous again.
 
 The only explicit way to do this is by calling
  IndexWriter.optmize(). But this call does a lot more (also merges all
  the segments), hence is very expensive. Is there a way to simply just
  finalize the deletes without having to merge all the segments?
 
  If not, I'd be glad to submit an implementation of this feature
  if
  the Lucene devs agree this is useful.
 
  Thanks
 
  -John
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Hit 2 has score less than Hit 3

2004-12-14 Thread Erik Hatcher
On Dec 14, 2004, at 4:53 AM, Vikas Gupta wrote:
I have come across a scenario where the hits returned are not sorted.  
Or
maybe they are sorted but the explanation is not correct.

Take a look at
http://cofferdam.cs.utexas.edu:8080/search.jsp? 
query=space+odysseyhitsPerPage=10hitsPerSite=0
This site was down when I tried to access it.
Look at the top 3 results.
Score of Hit 1 is 1.0188559
Score of Hit 2 is 0.9934416
Score of Hit 3 is 1.0188559
I can't explain how score of hit 2 can be  hit 3. I thought the hits  
that
were returned were sorted.
Hits should be in descending score order by default.  Are you using the  
new Sort facility at all?  Are you walking through Hits properly (i.e.  
hits.doc(i), i is the i-th hit, not document id i)?   What version of  
Lucene are you using?

Sorry - figured I'd rattle off some standard troubleshooting questions  
:)

FYI, the docs corresponding to hits 1,2 and 3 have exactly the same
scoring fields(By scoring fields, I mean the fields used in the query).
Use the IndexSearcher.explain() feature to get the real scoop on why a  
score is computed the way it is.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Praveen Peddi
Even we use lucene for similar purpose except that we index and store quite 
a few fields. Infact I also update partial documents as people suggested. I 
store all the indexed fields so I don't have to build the whole document 
again while updating partial document. The reason we do this is due to the 
speed. I found the lucene search on a millions objects is 4 to 5 times 
faster than our oracle queries (ofcourse this might be due to our pitiful 
database design :) ). It works great so far. the only caveat that we had 
till now was incremental updates. But now I am implementing real-time 
updates so that the data in lucene index is almost always in sync with data 
in database. So now, our search does not goto the database at all.

Praveen
- Original Message - 
From: Kevin L. Cobb [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:40 AM
Subject: Opinions: Using Lucene as a thin database

I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Nader Henein
How big do you expect it to get and how often do you expect to update 
it, we've been using Lucene for about 1 M records (19 fields each) with 
incremental updates every 10 minutes, the performance during updates 
wasn't wonderful, so it took some seriously intense code to sort that 
out, as you mentioned, it comes down to why you need the Thin DB for, 
Lucene is a wonderful search engine, but if I were looking at a fast and 
dirty relational DB, MySQL wins hands down, put them both together and 
you've really got something.

My 2 cents
Nader Henein
Kevin L. Cobb wrote:
I use Lucene as a legitimate search engine which is cool. But, I am also
using it as a simple database too. I build an index with a couple of
keyword fields that allows me to retrieve values based on exact matches
in those fields. This is all I need to do so it works just fine for my
needs. I also love the speed. The index is small enough that it is
wicked fast. Was wondering if anyone out there was doing the same of it
there are any dissenting opinions on using Lucene for this purpose. 




 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: TFIDF Implementation

2004-12-14 Thread Bruce Ritchie
Christoph,

I'm not entirely certain if this is what you want, but a while back David 
Spencer did code up a 'More Like This' class which can be used for generating 
similarities between documents. I can't seem to find this class in the sandbox 
so I've attached it here. Just repackage and test.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/   

 -Original Message-
 From: Christoph Kiefer [mailto:[EMAIL PROTECTED] 
 Sent: December 14, 2004 11:45 AM
 To: Lucene Users List
 Subject: TFIDF Implementation
 
 Hi,
 My current task/problem is the following: I need to implement 
 TFIDF document term ranking using Jakarta Lucene to compute a 
 similarity rank between arbitrary documents in the constructed index.
 I saw from the API that there are similar functions already 
 implemented in the class Similarity and DefaultSimilarity but 
 I don't know exactly how to use them. At the time my index 
 has about 25000 (small) documents and there are about 75000 
 terms stored in total.
 Now, my question is simple. Does anybody has done this before 
 or could point me to another location for help?
 
 Thanks for any help in advance.
 Christoph 
 
 --
 Christoph Kiefer
 
 Department of Informatics, University of Zurich
 
 Office: Uni Irchel 27-K-32
 Phone:  +41 (0) 44 / 635 67 26
 Email:  [EMAIL PROTECTED]
 Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TFIDF Implementation

2004-12-14 Thread Christoph Kiefer
Hi,
My current task/problem is the following: I need to implement TFIDF
document term ranking using Jakarta Lucene to compute a similarity rank
between arbitrary documents in the constructed index.
I saw from the API that there are similar functions already implemented
in the class Similarity and DefaultSimilarity but I don't know exactly
how to use them. At the time my index has about 25000 (small) documents
and there are about 75000 terms stored in total.
Now, my question is simple. Does anybody has done this before or could
point me to another location for help?

Thanks for any help in advance.
Christoph 

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Opinions: Using Lucene as a thin database

2004-12-14 Thread Otis Gospodnetic
You can see Flickr-like tag (lookup) system at my Simpy site (
http://www.simpy.com ).  It uses Lucene as the backend for lookups, but
still uses a RDBMS as the primary storage.

I find it that keeping the RDBMS and Lucene indices is a bit of a pain
and error prone, so _thin_ storage layer with simple requirements will
be okay with just using Lucene, while applications with more complex
domain models will quickly run into limitation (using the wrong tool
for the job type of problem).

Otis

--- Monsur Hossain [EMAIL PROTECTED] wrote:

 I think this is a great idea, and one that I've been mulling over to
 implement keyword lookups (similar to Flickr.com's tag system).  I
 believe the advantage over a relational database comes from Lucene's
 inverted index, which is highly optimized for this kind of lookup.  
 
 My concern is that this just shifts the scaling issue to Lucene, and
 I
 haven't found much info on how to scale Lucene vertically.  
 
 
 
 
  -Original Message-
  From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] 
  Sent: Tuesday, December 14, 2004 9:40 AM
  To: [EMAIL PROTECTED]
  Subject: Opinions: Using Lucene as a thin database
  
  
  I use Lucene as a legitimate search engine which is cool. 
  But, I am also using it as a simple database too. I build an 
  index with a couple of keyword fields that allows me to 
  retrieve values based on exact matches in those fields. This 
  is all I need to do so it works just fine for my needs. I 
  also love the speed. The index is small enough that it is 
  wicked fast. Was wondering if anyone out there was doing the 
  same of it there are any dissenting opinions on using Lucene 
  for this purpose. 
  
   
  
   
  
   
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opinions: Using Lucene as a thin database

2004-12-14 Thread Daniel Naber
On Tuesday 14 December 2004 20:13, Monsur Hossain wrote:

 My concern is that this just shifts the scaling issue to Lucene, and I
 haven't found much info on how to scale Lucene vertically. 

You can easily use MultiSearcher to search over several indices. If you 
want the distribution to be more transparent, have a look at Nutch.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-14 Thread Otis Gospodnetic
Hello,

There are a few things you can do:

1) Don't just pull all rows from the DB at once.  Do that in batches.

2) If you can get a Reader from your SqlDataReader, consider this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)

3) Give the JVM more memory to play with by using -Xms and -Xmx JVM
parameters

4) See IndexWriter's minMergeDocs parameter.

5) Are you calling optimize() at some point by any chance?  Leave that
call for the end.

1500 documents with 30 columns of short String/number values is not a
lot.  You may be doing something else not Lucene related that's slowing
things down.

Otis


--- Homam S.A. [EMAIL PROTECTED] wrote:

 I'm trying to index a large number of records from the
 DB (a few millions). Each record will be stored as a
 document with about 30 fields, most of them are
 UnStored and represent small strings or numbers. No
 huge DB Text fields.
 
 But I'm running out of memory very fast, and the
 indexing is slowing down to a crawl once I hit around
 1500 records. The problem is each document is holding
 references to the string objects returned from
 ToString() on the DB field, and the IndexWriter is
 holding references to all these document objects in
 memory, so the garbage collector is getting a chance
 to clean these up.
 
 How do you guys go about indexing a large DB table?
 Here's a snippet of my code (this method is called for
 each record in the DB):
 
 private void IndexRow(SqlDataReader rdr, IndexWriter
 iw) {
   Document doc = new Document();
   for (int i = 0; i  BrowseFieldNames.Length; i++) {
   doc.Add(Field.UnStored(BrowseFieldNames[i],
 rdr.GetValue(i).ToString()));
   }
   iw.AddDocument(doc);
 }
 
 
 
 
   
 __ 
 Do you Yahoo!? 
 Yahoo! Mail - Find what you need with new enhanced search.
 http://info.mail.yahoo.com/mail_250
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-14 Thread Vikas Gupta
Lucene uses the vector space model. To understand that:

-Read section 2.1 of Space optimizations for Total Ranking paper (Linked
here http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

 Hi all,
 Lucene score document based on the correlation between
 the query q and document t:
 (this is raw function, I don't pay attention to the
 boost_t, coord_q_d factor)

 score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
 / norm_d_t)  (*)

 Could anybody explain it in detail ? Or are there any
 papers, documents about this function ? Because:

 I have also read the book: Modern Information
 Retrieval, author: Ricardo Baeza-Yates and Berthier
 Ribeiro-Neto, Addison Wesley (Hope you have read it
 too). In page 27, they also suggest a scoring funtion
 for vector model based on the correlation between
 query q and document d as follow (I use different
 symbol):

sum_t( weight_t_d * weight_t_q)
 score_d(d, q)=  - (**)
 norm_d * norm_q

 where weight_t_d = tf_d * idf_t
   weight_t_q = tf_q * idf_t
   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )

 (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
 score_d(d, q)=-  (***)
  norm_d * norm_q

 The two function, (*) and (***), have 2 differences:
 1. in (***), the sum_t is just for the numerator but
 in the (*), the sum_t is for everything. So, with
 norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
 calculated twice. Is this right? please explain.

 2. No factor that define norms of the document: norm_d
 in the function (*). Can you explain this. what is the
 role of factor norm_d_t ?

 One more question: could anybody give me documents,
 papers that explain this function in detail. so when I
 apply Lucene for my system, I can adapt the document,
 and the field so that I still receive the correct
 scoring information from Lucene .

 Best regard,
 Thanks every body,

 =
 Ð#7863;ng Nhân

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Indexing a large number of DB records

2004-12-14 Thread Homam S.A.
Thanks Otis!

What do you mean by building it in batches? Does it
mean I should close the IndexWriter every 1000 rows
and reopen it? Does that releases references to the
document objects so that they can be
garbage-collected?

I'm calling optimize() only at the end.

I agree that 1500 documents is very small. I'm
building the index on a PC with 512 megs, and the
indexing process is quickly gobbling up around 400
megs when I index around 1800 documents and the whole
machine is grinding to a virtual halt. I'm using the
latest DotLucene .NET port, so may be there's a memory
leak in it.

I have experience with AltaVista search (acquired by
FastSearch), and I used to call MakeStable() every
20,000 documents to flush memory structures to disk.
There doesn't seem to be an equivalent in Lucene.

-- Homam






--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Hello,
 
 There are a few things you can do:
 
 1) Don't just pull all rows from the DB at once.  Do
 that in batches.
 
 2) If you can get a Reader from your SqlDataReader,
 consider this:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
 
 3) Give the JVM more memory to play with by using
 -Xms and -Xmx JVM
 parameters
 
 4) See IndexWriter's minMergeDocs parameter.
 
 5) Are you calling optimize() at some point by any
 chance?  Leave that
 call for the end.
 
 1500 documents with 30 columns of short
 String/number values is not a
 lot.  You may be doing something else not Lucene
 related that's slowing
 things down.
 
 Otis
 
 
 --- Homam S.A. [EMAIL PROTECTED] wrote:
 
  I'm trying to index a large number of records from
 the
  DB (a few millions). Each record will be stored as
 a
  document with about 30 fields, most of them are
  UnStored and represent small strings or numbers.
 No
  huge DB Text fields.
  
  But I'm running out of memory very fast, and the
  indexing is slowing down to a crawl once I hit
 around
  1500 records. The problem is each document is
 holding
  references to the string objects returned from
  ToString() on the DB field, and the IndexWriter is
  holding references to all these document objects
 in
  memory, so the garbage collector is getting a
 chance
  to clean these up.
  
  How do you guys go about indexing a large DB
 table?
  Here's a snippet of my code (this method is called
 for
  each record in the DB):
  
  private void IndexRow(SqlDataReader rdr,
 IndexWriter
  iw) {
  Document doc = new Document();
  for (int i = 0; i  BrowseFieldNames.Length; i++)
 {
  doc.Add(Field.UnStored(BrowseFieldNames[i],
  rdr.GetValue(i).ToString()));
  }
  iw.AddDocument(doc);
  }
  
  
  
  
  
  __ 
  Do you Yahoo!? 
  Yahoo! Mail - Find what you need with new enhanced
 search.
  http://info.mail.yahoo.com/mail_250
  
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
  
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 




__ 
Do you Yahoo!? 
Take Yahoo! Mail with you! Get it on your mobile phone. 
http://mobile.yahoo.com/maildemo 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]