File locking using java.nio.channels.FileLock

2004-12-15 Thread John Wang
Hi:

  When is Lucene planning on moving toward java 1.4+?

   I see there are some problems caused from the current lock file
implementation, e.g. Bug# 32171. The problems would be easily fixed by
using the java.nio.channels.FileLock object.

Thanks

-John

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
I'll try to address all the comments here.

The normalization I proposed a while back on lucene-dev is specified.
Its properties can be analyzed, so there is no reason to guess about
them.

Re. Hoss's example and analysis, yes, I believe it can be demonstrated
that the proposed normalization would make certain absolute statements
like x and y meaningful.  However, it is not a panacea -- there would be
some limitations in these statements.

To see what could be said meaningfully, it is necessary to recall a
couple detailed aspects of the proposal:
  1.  The normalization would not change the ranking order or the ratios
among scores in a single result set from what they are now.  Only two
things change:  the query normalization constant, and the ad hoc final
normalization in Hits is eliminated because the scores are intrinsically
between 0 and 1.  Another way to look at this is that the sole purpose
of the normalization is to set the score of the highest-scoring result.
Once this score is set, all the other scores are determined since the
ratios of their scores to that of the top-scoring result do not change
from today.  Put simply, Hoss's explanation is correct.
  2.  There are multiple ways to normalize and achieve property 1.  One
simple approach is to set the top score based on the boost-weighted
percentage of query terms it matches (assuming, for simplicity, the
query is an OR-type BooleanQuery).  So if all boosts are the same, the
top score is the percentage of query terms matched.  If there are
boosts, then these cause the terms to have a corresponding relative
importance in the determination of this percentage.

More complex normalization schemes would go further and allow the tf's
and/or idf's to play a role in the determination of the top score -- I
didn't specify details here and am not sure how good a thing that would
be to do.  So, for now, let's just consider the properties of the simple
boost-weighted-query-term percentage normalization.

Hoss's example could be interpreted as single-term phrases "Doug
Cutting" and "Chris Hostetter", or as two-term BooleanQuery's.
Considering both of these cases illustrates the absolute-statement
properties and limitations of the proposed normalization.

If single-term PhraseQuery's, then the top score will always be 1.0
assuming the phrase matches (while the other results have arbitrary
fractional scores based on the tfidf ratios as today).  If the queries
are BooleanQuery's with no boosts, then the top score would be 1.0 or
0.5 depending on whether 1 or two terms were matched.  This is
meaningful.

In Lucene today, the top score is not meaningful.  It will always be 1.0
if the highest intrinsic score is >= 1.0.  I believe this could happen,
for example, in a two-term BooleanQuery that matches only one term (if
the tf on the matched document for that term is high enough).

So, to be concrete, a score of 1.0 with the proposed normalization
scheme would mean that all query terms are matched, while today a score
of 1.0 doesn't really tell you anything.  Certain absolute statements
can therefore be made with the new scheme.  This makes the
absolute-threshold monitored search application possible, along with the
segregating and filtering applications I've previously mentioned (call
out good results and filter out bad results by using absolute
thresholds).

These analyses are simplified by using only BooleanQuery's, but I
believe the properties carry over generally.

Doug also asked about research results.  I don't know of published
research on this topic, but I can again repeat an experience from
InQuira.  We found that end users benefited from a search experience
where good results were called out and bad results were downplayed or
filtered out.  And we managed to achieve this with absolute thresholding
through careful normalization (of a much more complex scoring
mechanism).  To get a better intuitive feel for this, think about you
react to a search where all the results suck, but there is no visual
indication of this that is any different from a search that returns
great results.

Otis raised the patch I submitted for MultiSearcher.  This addresses a
related problem, in that the current MultiSearcher does not rank results
equivalently to a single unified index -- specifically it fails Daniel
Naber's test case.  However, this is just a simple bug whose fix doesn't
require the new normalization.  I submitted a patch to fix that bug,
along with a caveat that I'm not sure the patch is complete, or even
consistent with the intentions of the author of this mechanism.

I'm glad to see this topic is generating some interest, and apologize if
anything I've said comes across as overly abrasive.  I use and really
like Lucene.  I put a lot of focus on creating a great experience for
the end user, and so am perhaps more concerned about quality of results
and certain UI aspects than most other users.

Chuck

  > -Original Message-
  > From: Doug Cutting [mailto:[EMAI

Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 21:14, Mike Snare wrote:

> Also, the phrase query
> would place the same value on a doc that simply had the two words as a
> doc that had the hyphenated version, wouldn't it? ÂThis seems odd.

Not if these words are spelling variations of the same concept, which 
doesn't seem unlikely.

> In addition, why do we assume that a-1 is a "typical product name" but
> a-b isn't?

Maybe for "a-b", but what about English words like "half-baked"?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
"Doug Cutting" and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to "Doug Cutting"
If I then do a search for "Chris Hostetter" and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)
However, I *cannot* say either of the following:
  x) document #9 is as relevant for "Chris Hostetter" as document #1 is
 relevant to "Doug Cutting"
  y) document #5 is equally relevant to both "Chris Hostetter" and
 "Doug Cutting"
That's right.  Thanks for the nice description of the issue.
I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x & y.
And I am not convinced that, with the changes Chuck describes, one can 
be any more confident of x and y.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Chris Hostetter
: a-1 is considered a typical product name that needs to be unchanged
: (there's a comment in the source that mentions this). Indexing
: "hyphen-word" as two tokens has the advantage that it can then be found
: with the following queries:
: hypen-word (will be turned into a phrase query internally)
: "hypen word" (phrase query)
: (it cannot be found searching for hyphenword, however).

This isn't an area of Lucene that I've had a chance to investigate much
yet, but if I recall from my reading, Lucene allows you to place multiple
token sequences at the same position, generating something more easily
described as a "token graph" then a "token stream" .. correct?

so given an input "the quick-brown fox jumped over the a-1 sauce"
the tokenizer culd generate a token stream that looks like.

"the"
"quick" "brown"  OR  "quick-brown"  OR  "quickbrown"
"fox"
"jumped"
"over"
"the"
"a" "1"  OR  "a-1"  OR  "a1"
"sauce"

...at which point, a minimum 2 character word length filter, and stop
words filter could (if you wanted to use them) reduce that to...

"quick" "brown"  OR  "quick-brown"  OR  "quickbrown"
"fox"
"jumped"
"over"
"a-1"  OR  "a1"
"sauce"

allowing all of these future (phrase) searches to match...

the quick brown fox jumped over the a1 sauce
the quickbrown fox jumped over the a1 sauce
the quick-brown fox jumped over the a1 sauce
the quick brown fox jumped over the a 1 sauce
the quickbrown fox jumped over the a 1 sauce
the quick-brown fox jumped over the a 1 sauce
the quick brown fox jumped over the a-1 sauce
the quickbrown fox jumped over the a-1 sauce
the quick-brown fox jumped over the a-1 sauce


...correct? or am I missunderstanding?

-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Erik Hatcher
On Dec 15, 2004, at 3:14 PM, Mike Snare wrote:
[...]
In addition, why do we assume that a-1 is a "typical product name" but
a-b isn't?
I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.
It is one of those things we have to accept... or in this case write 
our own analyzer.  An Analyzer is a very special and custom choice.  
StandardAnalyzer is a general purpose one, but quite insufficient in 
many cases.  Like QueryParser.  We're lucky to have these kitchen-sink 
pieces in Lucene to get us going quickly, but digging deeper we often 
need custom solutions.

I'm working on indexing the e-book of Lucene in Action.  I'll blog up 
the details of this in the near future as case-study material, but 
here's the short version...

I got the PDF file, ran pdftotext on it.  Many words are split across 
lines with a hyphen.  Often these pieces should be combined with the 
hyphen removed.  Sometimes, though, these words are to be split.  The 
scenario is different than yours, because I want the hyphens gone - 
though sometimes they are a separator and sometimes they should be 
removed.  It depends.  I wrote a custom analyzer with several custom 
filters in the pipeline... dashes are originally kept in the stream, 
and a later filter combines two tokens and looks it up in an exception 
list and either combines it or leaves it separate.  StandardAnalyzer 
would have wreaked havoc.

The results of my work will soon be available to all to poke at, but 
for now a screenshot is all I have public:

http://www.lucenebook.com
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is > X.  So that is where the absolue value of the
score would be useful.
Right, but the question is, would a single score threshold be effective 
for all queries, or would one need a separate score threshold for each 
query?  My hunch is that the latter is better, regardless of the scoring 
algorithm.

Also, just because Lucene's default scoring does not guarantee scores 
between zero and one does not necessarily mean that these scores are 
less "meaningful".

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is > X.  So that is where the absolue value of the
score would be useful.

I believe Chuck submitted some code that fixes this, which also helps
with MultiSearcher, where you have to have this contant score in order
to properly order hits from different Searchers, but I didn't dare to
touch that code without further studying, for which I didn't have time.

Otis


--- Doug Cutting <[EMAIL PROTECTED]> wrote:

> Chuck Williams wrote:
> > I believe the biggest problem with Lucene's approach relative to
> the pure vector space model is that Lucene does not properly
> normalize.  The pure vector space model implements a cosine in the
> strictly positive sector of the coordinate space.  This is guaranteed
> intrinsically to be between 0 and 1, and produces scores that can be
> compared across distinct queries (i.e., "0.8" means something about
> the result quality independent of the query).
> 
> I question whether such scores are more meaningful.  Yes, such scores
> 
> would be guaranteed to be between zero and one, but would 0.8 really
> be 
> meaningful?  I don't think so.  Do you have pointers to research
> which 
> demonstrates this?  E.g., when such a scoring method is used, that 
> thresholding by score is useful across queries?
> 
> Doug
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
> a-1 is considered a typical product name that needs to be unchanged
> (there's a comment in the source that mentions this). Indexing
> "hyphen-word" as two tokens has the advantage that it can then be found
> with the following queries:
> hypen-word (will be turned into a phrase query internally)
> "hypen word" (phrase query)
> (it cannot be found searching for hyphenword, however).

Sure.  But phrase queries are slower than a single word query.  In my
case, using the standard analyzer prior to my modification caused a
single (hyphenated) word query to take upwards of 10 seconds (1M+
documents with ~400K terms).  The exact same search with the new
Analyzer takes <.5 seconds (granted the new tokenization caused a
significant reduction in the number of terms).  Also, the phrase query
would place the same value on a doc that simply had the two words as a
doc that had the hyphenated version, wouldn't it?  This seems odd.

In addition, why do we assume that a-1 is a "typical product name" but
a-b isn't?

I am in no way second-guessing or suggesting a change, It just doesn't
make sense to me, and I'm trying to understand.  It is very likely, as
is oft the case, that this is just one of those things one has to
accept.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A question about scoring function in Lucene

2004-12-15 Thread Chris Hostetter
: I question whether such scores are more meaningful.  Yes, such scores
: would be guaranteed to be between zero and one, but would 0.8 really be
: meaningful?  I don't think so.  Do you have pointers to research which
: demonstrates this?  E.g., when such a scoring method is used, that
: thresholding by score is useful across queries?

I freely admit that I'm way out of my league on these scoring discussions,
but I believe what the OP was refering to was not any intrinsic benefit in
having a score between 0 and 1, but of having a uniform normalization of
scores regardless of search terms.

For example, using the current scoring equation, if i do a search for
"Doug Cutting" and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to "Doug Cutting"

If I then do a search for "Chris Hostetter" and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1

...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)

However, I *cannot* say either of the following:
  x) document #9 is as relevant for "Chris Hostetter" as document #1 is
 relevant to "Doug Cutting"
  y) document #5 is equally relevant to both "Chris Hostetter" and
 "Doug Cutting"


I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x & y.

If they are correct, then I for one can see a definite benefit in that.
If for no other reason then in making minimum score thresholds more
meaningful.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Daniel Naber
On Wednesday 15 December 2004 19:29, Mike Snare wrote:

> In my case, the words are keywords that must remain as is, searchable
> with the hyphen in place. ÂIt was easy enough to modify the tokenizer
> to do what I need, so I'm not really asking for help there. ÂI'm
> really just curious as to why it is that "a-1" is considered a single
> token, but "a-b" is split.

a-1 is considered a typical product name that needs to be unchanged 
(there's a comment in the source that mentions this). Indexing 
"hyphen-word" as two tokens has the advantage that it can then be found 
with the following queries:
hypen-word (will be turned into a phrase query internally)
"hypen word" (phrase query)
(it cannot be found searching for hyphenword, however).

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: TFIDF Implementation

2004-12-15 Thread David Spencer
Christoph Kiefer wrote:
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.
What do you think?
I don't have any deep thoughts, just a few questions/ideas...
[1] TFIDFMatrix, FeatureVectorSimilarityMeasure, and CosineMeasure are 
your classes right, which are not in the mail, but presumably the source 
isn't needed.

[2] Does the problem boil down to this line and the memory usage?
double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
Thus using a sparse matrix would be a win, and so would using floats 
instead of doubles?

[3] Prob minor, but in getTFIDFMatrix() you might be able to ignore stop 
words, as you do so later in getSimilarity().

[4] You can also consider using Colt possibly even JUNG:
http://www-itg.lbl.gov/~hoschek/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html
http://jung.sourceforge.net/doc/api/index.html
[5]
Related to #2, can you precalc the matrix and store it on disk, or is 
your index too dynamic?

[6] Also, in similar kinds of calculations I've seen code that filters 
out low frequency terms e.g. ignore all terms that don't occur in at 
least 5 docs.

-- Dave

Best,
Christoph


/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import Jama.Matrix;
/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {

private File indexDir = null;
private File dataDir = null;
private String target = "";
private String query = "";
private int targetDocumentNumber = -1;
private final String ME = this.getClass().getName();
private int fileCounter = 0;

public TFIDF_Lucene( String indexDir, String dataDir, String target, 
String query ) {
this.indexDir = new File(indexDir);
this.dataDir = new File(dataDir);
this.target = target;
this.query = query;
}

public String getName() {
return "TFIDF_Lucene_Similarity_Measure";
}

private void makeIndex() {
try {
IndexWriter writer = new IndexWriter(indexDir, new 
SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS ), false);
indexDirectory(writer, dataDir);
writer.optimize();
writer.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}

private void indexDirectory(IndexWriter writer, File dir) {
File[] files = dir.listFiles();
for (int i=0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
} else if (f.getName().endsWith(".txt")) {
indexFile(writer, f);
}
}
}

private void indexFile(IndexWriter writer, File f) {
try {
System.out.println( "Indexing " + f.getName() + ", " + 
(fileCounter++) );
String name = f.getCanonicalPath();
//System.out.println(name);
Document doc = new Document();
doc.add( Field.Text( "contents", new FileReader(f), 
true ) );
writer.addDocument( doc );
 

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize.  The pure vector space model implements a cosine in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., "0.8" means something about the result quality independent of the query).
I question whether such scores are more meaningful.  Yes, such scores 
would be guaranteed to be between zero and one, but would 0.8 really be 
meaningful?  I don't think so.  Do you have pointers to research which 
demonstrates this?  E.g., when such a scoring method is used, that 
thresholding by score is useful across queries?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: C# Ports

2004-12-15 Thread Ben Litchfield


I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.

The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip

This includes the ant script used to create the DLL files.

This method is by far the easiest way to port it, see previous posts about
advantages and disadvantages.

Ben


On Wed, 15 Dec 2004, Garrett Heaver wrote:

> I was just wondering what tools (JLCA?) people are using to port Lucene to
> c# as I'd be well interesting in converting things like snowball stemmers,
> wordnet etc.
>
>
>
> Thanks
>
> Garrett
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Note that this really includes some extra steps.
You don't need a temp index.  Add everything to a single index using a
single IndexWriter instance.  No need to call addIndexes nor optimize
until the end.  Adding Documents to an index takes a constant amount of
time, regardless of the index size, because new segments are created as
documents are added, and existing segments don't need to be updated
(only when merges happen).  Again, I'd run your app under a profiler to
see where the time and memory are going.

Otis

--- Garrett Heaver <[EMAIL PROTECTED]> wrote:

> Hi Homan
> 
> I had a similar problem as you in that I was indexing A LOT of data
> 
> Essentially how I got round it was to batch the index.
> 
> What I was doing was to add 10,000 documents to a temporary index,
> use
> addIndexes() to merge to temporary index into the live index (which
> also
> optimizes the live index) then delete the temporary index. On the
> next loop
> I'd only query rows from the db above the id in the maxdoc of the
> live index
> and set the max rows of the query to to 10,000
> i.e
> 
> SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] > {ID from
> Index.MaxDoc()} ORDER BY [id_field] ASC
> 
> Ensuring that the documents go into the index sequentially your
> problem is
> solved and memory usage on mine (dotlucene 1.3) is low
> 
> Regards
> Garrett
> 
> -Original Message-
> From: Homam S.A. [mailto:[EMAIL PROTECTED] 
> Sent: 15 December 2004 02:43
> To: Lucene Users List
> Subject: Indexing a large number of DB records
> 
> I'm trying to index a large number of records from the
> DB (a few millions). Each record will be stored as a
> document with about 30 fields, most of them are
> UnStored and represent small strings or numbers. No
> huge DB Text fields.
> 
> But I'm running out of memory very fast, and the
> indexing is slowing down to a crawl once I hit around
> 1500 records. The problem is each document is holding
> references to the string objects returned from
> ToString() on the DB field, and the IndexWriter is
> holding references to all these document objects in
> memory, so the garbage collector is getting a chance
> to clean these up.
> 
> How do you guys go about indexing a large DB table?
> Here's a snippet of my code (this method is called for
> each record in the DB):
> 
> private void IndexRow(SqlDataReader rdr, IndexWriter
> iw) {
>   Document doc = new Document();
>   for (int i = 0; i < BrowseFieldNames.Length; i++) {
>   doc.Add(Field.UnStored(BrowseFieldNames[i],
> rdr.GetValue(i).ToString()));
>   }
>   iw.AddDocument(doc);
> }
> 
> 
> 
> 
>   
> __ 
> Do you Yahoo!? 
> Yahoo! Mail - Find what you need with new enhanced search.
> http://info.mail.yahoo.com/mail_250
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why does the StandardTokenizer split hyphenated words?

2004-12-15 Thread Mike Snare
I am writing a tool that uses lucene, and I immediately ran into a
problem searching for words that contain internal hyphens (dashes).
After looking at the StandardTokenizer, I saw that it was because
there is no rule that will matchor  
.  Based on what I can tell from the source, every other
term in a word containing any of the following (.,/-_) must contain at
least one digit.

I was wondering if someone could shed some light on why it was deemed
necessary to prevent indexing a word like 'word-with-hyphen' without
first splitting it into its constituent parts.  The only reason I can
think of (and the only one I've found) is to handle hyphenated words
at line breaks, although my first thought would be that this would be
undesired behavior, since a word that was broken due to a line break
should actually be reconstructed, and not split.

In my case, the words are keywords that must remain as is, searchable
with the hyphen in place.  It was easy enough to modify the tokenizer
to do what I need, so I'm not really asking for help there.  I'm
really just curious as to why it is that "a-1" is considered a single
token, but "a-b" is split.

Anyone care to elaborate?

Thanks,
-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
Nhan,

You are correct that dropping the document norm does cause Lucene's scoring 
model to deviate from the pure vector space model.  However, including norm_d 
would cause other problems -- e.g., with short queries, as are typical in 
reality, the resulting scores with norm_d would all be extremely small.  You 
are also correct that since norm_q is invariant, it does not affect relevance 
ranking.  Norm_q is simply part of the normalization of final scores.  There 
are many different formulas for scoring and relevance ranking in IR.  All of 
these have some intuitive justification, but in the end can only be evaluated 
empirically.  There is no "correct" formula.

I believe the biggest problem with Lucene's approach relative to the pure 
vector space model is that Lucene does not properly normalize.  The pure vector 
space model implements a cosine in the strictly positive sector of the 
coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and 
produces scores that can be compared across distinct queries (i.e., "0.8" means 
something about the result quality independent of the query).

Lucene does not have this property.  Its formula produces scores of arbitrary 
magnitude depending on the query.  The results cannot be compared meaningfully 
across queries; i.e., "0.8" means nothing intrinsically.  To keep final scores 
between 0 and 1, Lucene introduces an ad hoc query-dependent final 
normalization in Hits:  viz., it divides all scores by the highest score if the 
highest score happens to be greater than 1.  This makes it impossible for an 
application to properly inform its users about the quality of the results, to 
cut off bad results, etc.  Applications may do that, but in fact what they are 
doing is random, not what they think they are doing.

I've proposed a fix for this -- there was a long thread on Lucene-dev.  It is 
possible to revise Lucene's scoring to keep its efficiency, keep its current 
per-query relevance ranking, and yet intrinsically normalize its scores so that 
they are meaningful across queries.  I posted a fairly detailed spec of how to 
do this in the Lucene-dev thread.  I'm hoping to have time to build it and 
submit it as a proposed update to Lucene, but it is a large effort that would 
involve changing just about every scoring class in Lucene.  I'm not sure it 
would be incorporated even if I did it as that would take considerable work 
from a developer.  There doesn't seem to be much concern about these various 
scoring and relevancy ranking issues among the general Lucene community.

Chuck

  > -Original Message-
  > From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED]
  > Sent: Wednesday, December 15, 2004 1:18 AM
  > To: Lucene Users List
  > Subject: RE: A question about scoring function in Lucene
  > 
  > Thank for your answer,
  > In Lucene scoring function, they use only norm_q,
  > but for one query, norm_q is the same for all
  > documents.
  > So norm_q is actually not effect the score.
  > But norm_d is different, each document has a different
  > norm_d; it effect the score of document d for query q.
  > If you drop it, the score information is not correct
  > anymore or it not space vector model anymore.  Could
  > you explain it a little bit.
  > 
  > I think that it's expensive to computed in incremetal
  > indexing because when one document is added, idf of
  > each term changed. But drop it is not a good choice.
  > 
  > What is the role of norm_d_t ?
  > Nhan.
  > 
  > --- Chuck Williams <[EMAIL PROTECTED]> wrote:
  > 
  > > Nhan,
  > >
  > > Re.  your two differences:
  > >
  > > 1 is not a difference.  Norm_d and Norm_q are both
  > > independent of t, so summing over t has no effect on
  > > them.  I.e., Norm_d * Norm_q is constant wrt the
  > > summation, so it doesn't matter if the sum is over
  > > just the numerator or over the entire fraction, the
  > > result is the same.
  > >
  > > 2 is a difference.  Lucene uses Norm_q instead of
  > > Norm_d because Norm_d is too expensive to compute,
  > > especially in the presence of incremental indexing.
  > > E.g., adding or deleting any document changes the
  > > idf's, so if Norm_d was used it would have to be
  > > recomputed for ALL documents.  This is not feasible.
  > >
  > > Another point you did not mention is that the idf
  > > term is squared (in both of your formulas).  Salton,
  > > the originator of the vector space model, dropped
  > > one idf factor from his formula as it improved
  > > results empirically.  More recent theoretical
  > > justifications of tf*idf provide intuitive
  > > explanations of why idf should only be included
  > > linearly.  tf is best thought of as the real vector
  > > entry, while idf is a weighting term on the
  > > components of the inner product.  E.g., seen the
  > > excellent paper by Robertson, "Understanding inverse
  > > document frequency: on theoretical arguments for
  > > IDF", available here:
  > > http://www.emeraldinsight.

RE: Indexing a large number of DB records

2004-12-15 Thread Garrett Heaver
Hi Homan

I had a similar problem as you in that I was indexing A LOT of data

Essentially how I got round it was to batch the index.

What I was doing was to add 10,000 documents to a temporary index, use
addIndexes() to merge to temporary index into the live index (which also
optimizes the live index) then delete the temporary index. On the next loop
I'd only query rows from the db above the id in the maxdoc of the live index
and set the max rows of the query to to 10,000
i.e

SELECT TOP 1 [fields] FROM [tables] WHERE [id_field] > {ID from
Index.MaxDoc()} ORDER BY [id_field] ASC

Ensuring that the documents go into the index sequentially your problem is
solved and memory usage on mine (dotlucene 1.3) is low

Regards
Garrett

-Original Message-
From: Homam S.A. [mailto:[EMAIL PROTECTED] 
Sent: 15 December 2004 02:43
To: Lucene Users List
Subject: Indexing a large number of DB records

I'm trying to index a large number of records from the
DB (a few millions). Each record will be stored as a
document with about 30 fields, most of them are
UnStored and represent small strings or numbers. No
huge DB Text fields.

But I'm running out of memory very fast, and the
indexing is slowing down to a crawl once I hit around
1500 records. The problem is each document is holding
references to the string objects returned from
ToString() on the DB field, and the IndexWriter is
holding references to all these document objects in
memory, so the garbage collector is getting a chance
to clean these up.

How do you guys go about indexing a large DB table?
Here's a snippet of my code (this method is called for
each record in the DB):

private void IndexRow(SqlDataReader rdr, IndexWriter
iw) {
Document doc = new Document();
for (int i = 0; i < BrowseFieldNames.Length; i++) {
doc.Add(Field.UnStored(BrowseFieldNames[i],
rdr.GetValue(i).ToString()));
}
iw.AddDocument(doc);
}





__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Indexing a large number of DB records

2004-12-15 Thread Otis Gospodnetic
Hello Homam,

The batches I was referring to were batches of DB rows.
Instead of SELECT * FROM table... do SELECT * FROM table ... OFFSET=X
LIMIT=Y.

Don't close IndexWriter - use the single instance.

There is no MakeStable()-like method in Lucene, but you can control the
number of in-memory Documents, the frequence of segment merges, and
maximal size of an index segments with 3 IndexWriter parameters,
described fairly verbosely in the javadocs.

Since you are using the .Net version, you should really consult
dotLucene guy(s).  Running under the profiler should also tell you
where the time and memory go.

Otis

--- "Homam S.A." <[EMAIL PROTECTED]> wrote:

> Thanks Otis!
> 
> What do you mean by building it in batches? Does it
> mean I should close the IndexWriter every 1000 rows
> and reopen it? Does that releases references to the
> document objects so that they can be
> garbage-collected?
> 
> I'm calling optimize() only at the end.
> 
> I agree that 1500 documents is very small. I'm
> building the index on a PC with 512 megs, and the
> indexing process is quickly gobbling up around 400
> megs when I index around 1800 documents and the whole
> machine is grinding to a virtual halt. I'm using the
> latest DotLucene .NET port, so may be there's a memory
> leak in it.
> 
> I have experience with AltaVista search (acquired by
> FastSearch), and I used to call MakeStable() every
> 20,000 documents to flush memory structures to disk.
> There doesn't seem to be an equivalent in Lucene.
> 
> -- Homam
> 
> 
> 
> 
> 
> 
> --- Otis Gospodnetic <[EMAIL PROTECTED]>
> wrote:
> 
> > Hello,
> > 
> > There are a few things you can do:
> > 
> > 1) Don't just pull all rows from the DB at once.  Do
> > that in batches.
> > 
> > 2) If you can get a Reader from your SqlDataReader,
> > consider this:
> >
>
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader)
> > 
> > 3) Give the JVM more memory to play with by using
> > -Xms and -Xmx JVM
> > parameters
> > 
> > 4) See IndexWriter's minMergeDocs parameter.
> > 
> > 5) Are you calling optimize() at some point by any
> > chance?  Leave that
> > call for the end.
> > 
> > 1500 documents with 30 columns of short
> > String/number values is not a
> > lot.  You may be doing something else not Lucene
> > related that's slowing
> > things down.
> > 
> > Otis
> > 
> > 
> > --- "Homam S.A." <[EMAIL PROTECTED]> wrote:
> > 
> > > I'm trying to index a large number of records from
> > the
> > > DB (a few millions). Each record will be stored as
> > a
> > > document with about 30 fields, most of them are
> > > UnStored and represent small strings or numbers.
> > No
> > > huge DB Text fields.
> > > 
> > > But I'm running out of memory very fast, and the
> > > indexing is slowing down to a crawl once I hit
> > around
> > > 1500 records. The problem is each document is
> > holding
> > > references to the string objects returned from
> > > ToString() on the DB field, and the IndexWriter is
> > > holding references to all these document objects
> > in
> > > memory, so the garbage collector is getting a
> > chance
> > > to clean these up.
> > > 
> > > How do you guys go about indexing a large DB
> > table?
> > > Here's a snippet of my code (this method is called
> > for
> > > each record in the DB):
> > > 
> > > private void IndexRow(SqlDataReader rdr,
> > IndexWriter
> > > iw) {
> > >   Document doc = new Document();
> > >   for (int i = 0; i < BrowseFieldNames.Length; i++)
> > {
> > >   doc.Add(Field.UnStored(BrowseFieldNames[i],
> > > rdr.GetValue(i).ToString()));
> > >   }
> > >   iw.AddDocument(doc);
> > > }
> > > 
> > > 
> > > 
> > > 
> > >   
> > > __ 
> > > Do you Yahoo!? 
> > > Yahoo! Mail - Find what you need with new enhanced
> > search.
> > > http://info.mail.yahoo.com/mail_250
> > > 
> > >
> >
> -
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > 
> > > 
> > 
> > 
> >
> -
> > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > 
> > 
> 
> 
> 
>   
> __ 
> Do you Yahoo!? 
> Take Yahoo! Mail with you! Get it on your mobile phone. 
> http://mobile.yahoo.com/maildemo 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: C# Ports

2004-12-15 Thread George Aroush
Hi Garrett,

If you are referring to dotLucene
(http://sourceforge.net/projects/dotlucene/) than I can tell you how -- not
too long ago I posted on this list how I ported 1.4 and 1.4.3 to C#, please
search the list for the answer -- you can't just use JLCA.

As for the snwball, I have already started work on it.  The port is done,
but I have to test, etc. and I am too tied up right now with my work.
However, I plan to release it before end of this month, so if you can wait,
do wait, otherwise feel free to take the steps that I did to port Lucene to
C#.

Regards,

-- George Aroush

-Original Message-
From: Garrett Heaver [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, December 15, 2004 5:58 AM
To: [EMAIL PROTECTED]
Subject: C# Ports

I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.

 

Thanks

Garrett



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


C# Ports

2004-12-15 Thread Garrett Heaver
I was just wondering what tools (JLCA?) people are using to port Lucene to
c# as I'd be well interesting in converting things like snowball stemmers,
wordnet etc.

 

Thanks

Garrett



RE: A question about scoring function in Lucene

2004-12-15 Thread Nhan Nguyen Dang
Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop it, the score information is not correct
anymore or it not space vector model anymore.  Could
you explain it a little bit.

I think that it's expensive to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.

What is the role of norm_d_t ?
Nhan.

--- Chuck Williams <[EMAIL PROTECTED]> wrote:

> Nhan,
> 
> Re.  your two differences:
> 
> 1 is not a difference.  Norm_d and Norm_q are both
> independent of t, so summing over t has no effect on
> them.  I.e., Norm_d * Norm_q is constant wrt the
> summation, so it doesn't matter if the sum is over
> just the numerator or over the entire fraction, the
> result is the same.
> 
> 2 is a difference.  Lucene uses Norm_q instead of
> Norm_d because Norm_d is too expensive to compute,
> especially in the presence of incremental indexing. 
> E.g., adding or deleting any document changes the
> idf's, so if Norm_d was used it would have to be
> recomputed for ALL documents.  This is not feasible.
> 
> Another point you did not mention is that the idf
> term is squared (in both of your formulas).  Salton,
> the originator of the vector space model, dropped
> one idf factor from his formula as it improved
> results empirically.  More recent theoretical
> justifications of tf*idf provide intuitive
> explanations of why idf should only be included
> linearly.  tf is best thought of as the real vector
> entry, while idf is a weighting term on the
> components of the inner product.  E.g., seen the
> excellent paper by Robertson, "Understanding inverse
> document frequency: on theoretical arguments for
> IDF", available here: 
> http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
> if you sign up for an eval.
> 
> It's easy to correct for idf^2 by using a customer
> Similarity that takes a final square root.
> 
> Chuck
> 
>   > -Original Message-
>   > From: Vikas Gupta [mailto:[EMAIL PROTECTED]
>   > Sent: Tuesday, December 14, 2004 9:32 PM
>   > To: Lucene Users List
>   > Subject: Re: A question about scoring function
> in Lucene
>   > 
>   > Lucene uses the vector space model. To
> understand that:
>   > 
>   > -Read section 2.1 of "Space optimizations for
> Total Ranking" paper
>   > (Linked
>   > here
> http://lucene.sourceforge.net/publications.html)
>   > -Read section 6 to 6.4 of
>   >
>
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
>   > -Read section 1 of
>   >
>
http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
>   > 
>   > Vikas
>   > 
>   > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
>   > 
>   > > Hi all,
>   > > Lucene score document based on the correlation
> between
>   > > the query q and document t:
>   > > (this is raw function, I don't pay attention
> to the
>   > > boost_t, coord_q_d factor)
>   > >
>   > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d
> * idf_t
>   > > / norm_d_t)  (*)
>   > >
>   > > Could anybody explain it in detail ? Or are
> there any
>   > > papers, documents about this function ?
> Because:
>   > >
>   > > I have also read the book: Modern Information
>   > > Retrieval, author: Ricardo Baeza-Yates and
> Berthier
>   > > Ribeiro-Neto, Addison Wesley (Hope you have
> read it
>   > > too). In page 27, they also suggest a scoring
> funtion
>   > > for vector model based on the correlation
> between
>   > > query q and document d as follow (I use
> different
>   > > symbol):
>   > >
>   > >  sum_t( weight_t_d * weight_t_q)
>   > > score_d(d, q)= 
> - (**)
>   > >   norm_d * norm_q
>   > >
>   > > where weight_t_d = tf_d * idf_t
>   > >   weight_t_q = tf_q * idf_t
>   > >   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
> )
>   > >   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
> )
>   > >
>   > > (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
>   > > score_d(d,
> q)=-  (***)
>   > >norm_d * norm_q
>   > >
>   > > The two function, (*) and (***), have 2
> differences:
>   > > 1. in (***), the sum_t is just for the
> numerator but
>   > > in the (*), the sum_t is for everything. So,
> with
>   > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
>   > > calculated twice. Is this right? please
> explain.
>   > >
>   > > 2. No factor that define norms of the
> document: norm_d
>   > > in the function (*). Can you explain this.
> what is the
>   > > role of factor norm_d_t ?
>   > >
>   > > One more question: could anybody give me
> documents,
>   > > papers that explain this function in detail.
> so when I
>   > > apply Lucene for my system, I can adapt the
> document,
>   > > and the field so that I still receive

Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-15 Thread Nader Henein
This is a OS file system error not a Lucene issue (not for this board) , 
Google it for Gentoo specifically you a get a whole bunch of results one 
of which is this thread on the Gentoo Forums, 
http://forums.gentoo.org/viewtopic.php?t=9620

Good Luck
Nader Henein
Karthik N S wrote:
Hi Guys
Some body tell me what this Exception am Getting Pleae
Sys Specifications
O/s Linux Gentoo
Appserver Apache Tomcat/4.1.24
Jdk build 1.4.2_03-b02
Lucene 1.4.1 ,2, 3
Note: - This Exception is displayed on Every 2nd Query after Tomcat is
started
java.io.IOException: Stale NFS file handle
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:307)
   at
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420)
   at
org.apache.lucene.store.InputStream.readBytes(InputStream.java:61)
   at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou
ndFileReader.java:220)
   at org.apache.lucene.store.InputStream.refill(InputStream.java:158)
   at org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
   at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
   at
org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142)
   at
org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
   at
org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143)
   at
org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137)
   at
org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253)
   at
org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69)
   at org.apache.lucene.search.Similarity.idf(Similarity.java:255)
   at
org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery.
java:47)
   at org.apache.lucene.search.Query.weight(Query.java:86)
   at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
   at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:
251)


 WITH WARM REGARDS
 HAVE A NICE DAY
 [ N.S.KARTHIK]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception

2004-12-15 Thread John Haxby
Karthik N S wrote:
java.io.IOException: Stale NFS file handle
 

You have a file system NFS mounted on this machine, but the machine 
hosting the real file system has no knowledge of your mount.  This often 
happens after the host machine has had a reboot.

Solution: unmount (and possibly re-mount) the failing NFS file system.   
If you're not sure which one it is, try looking at a file on each NFS 
file system with, say, "cat" or "wc" and see if you get a stale NFS 
handle error.   You may need "umount -f" to unmount the failing file 
system.   Sometimes, very occasioanlly, you'll have to resort to a reboot.

jch
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: TFIDF Implementation

2004-12-15 Thread Christoph Kiefer
David, Bruce, Otis,
Thank you all for the quick replies. I looked through the BooksLikeThis
example. I also agree, it's a very good and effective way to find
similar docs in the index. Nevertheless, what I need is really a
similarity matrix holding all TF*IDF values. For illustration I quick
and dirty wrote a class to perform that task. It uses the Jama.Matrix
class to represent the similarity matrix at the moment. For show and
tell I attached it to this email.
Unfortunately it doesn't perform very well. My index stores about 25000
docs with a total of 75000 terms. The similarity matrix is very sparse
but nevertheless needs about 1'875'000'000 entries!!! I think this
current implementation will not be useable in this way. I also think I
switch to JMP (http://www.math.uib.no/~bjornoh/mtj/) for that reason.

What do you think?

Best,
Christoph

-- 
Christoph Kiefer

Department of Informatics, University of Zurich

Office: Uni Irchel 27-K-32
Phone:  +41 (0) 44 / 635 67 26
Email:  [EMAIL PROTECTED]
Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html
/*
 * Created on Dec 14, 2004
 */
package ch.unizh.ifi.ddis.simpack.measure.featurevectors;

import java.io.File;
import java.io.FileReader;
import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermDocs;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

import Jama.Matrix;

/**
 * @author Christoph Kiefer
 */
public class TFIDF_Lucene extends FeatureVectorSimilarityMeasure {
	
	private File indexDir = null;
	private File dataDir = null;
	private String target = "";
	private String query = "";
	private int targetDocumentNumber = -1;
	private final String ME = this.getClass().getName();
	private int fileCounter = 0;
	
	public TFIDF_Lucene( String indexDir, String dataDir, String target, String query ) {
		this.indexDir = new File(indexDir);
		this.dataDir = new File(dataDir);
		this.target = target;
		this.query = query;
	}
	
	public String getName() {
		return "TFIDF_Lucene_Similarity_Measure";
	}
	
	private void makeIndex() {
		try {
			IndexWriter writer = new IndexWriter(indexDir, new SnowballAnalyzer( "English", StopAnalyzer.ENGLISH_STOP_WORDS ), false);
			indexDirectory(writer, dataDir);
			writer.optimize();
			writer.close();
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	private void indexDirectory(IndexWriter writer, File dir) {
		File[] files = dir.listFiles();
		for (int i=0; i < files.length; i++) {
			File f = files[i];
			if (f.isDirectory()) {
indexDirectory(writer, f);  // recurse
			} else if (f.getName().endsWith(".txt")) {
indexFile(writer, f);
			}
		}
	}
	
	private void indexFile(IndexWriter writer, File f) {
		try {
			System.out.println( "Indexing " + f.getName() + ", " + (fileCounter++) );
			String name = f.getCanonicalPath();
			//System.out.println(name);
			Document doc = new Document();
			doc.add( Field.Text( "contents", new FileReader(f), true ) );
			writer.addDocument( doc );
			
			if ( name.matches( dataDir + "/" + target + ".txt" ) ) {
targetDocumentNumber = writer.docCount();
			}
			
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
	
	public Matrix getTFIDFMatrix(File indexDir) throws IOException {
		Directory fsDir = FSDirectory.getDirectory( indexDir, false );
		IndexReader reader = IndexReader.open( fsDir );
		
		int numberOfTerms = 0;
		int numberOfDocuments = reader.numDocs();
		
		TermEnum allTerms = reader.terms();
		for ( ; allTerms.next(); ) {
			allTerms.term();
			numberOfTerms++;
		}
		
		System.out.println( "Total number of terms in index is " + numberOfTerms );
		System.out.println( "Total number of documents in index is " + numberOfDocuments );
		
		double [][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];
		
		for ( int i = 0; i < numberOfTerms; i++ ) {
			for ( int j = 0; j < numberOfDocuments; j++ ) {
TFIDFMatrix[i][j] = 0.0;
			}
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			
			Term term = allTerms.term();
			TermDocs td = reader.termDocs(term);
			for ( ; td.next(); ) {
TFIDFMatrix[i][td.doc()] = td.freq();
			}
			
		}
		
		allTerms = reader.terms();
		for ( int i = 0 ; allTerms.next(); i++ ) {
			for ( int j = 0; j < numberOfDocuments; j++ ) {
double tf = TFIDFMatrix[i][j];
double docFreq = (double)allTerms.docFreq();
double idf = ( Math.log( (double)numberOfDocuments / docFreq ) ) / 2.30258509299405;
//System.out.println( "Term: " + i + " Document " + j + " TF/DocFreq/IDF: " + tf + " " + docFreq + " " + idf);
TFIDFMatrix[i][j] = tf * idf;