Re: incremental update of index

2008-11-10 Thread Donna L Gresh
ChadDavis <[EMAIL PROTECTED]> wrote on 11/10/2008 02:22:45 PM: > In the FAQ's it says that you have to do a manual incremental update: > > How do I update a document or a set of documents that are already indexed? > > > > There is no direct update procedure in Lucene. To update an index > > incr

Re: AW: Parsing MSWord

2008-11-12 Thread Donna L Gresh
Check out POI; that's what I use http://poi.apache.org/ "Sertic Mirko, Bedag" <[EMAIL PROTECTED]> wrote on 11/12/2008 03:25:47 AM: > Hi > > You can also use a tool called "antiword" to extract the text from a > .doc file, and then > give the text to lucene. > > See here : http://en.wikipedia

Re: TopDocs - Get all docs?

2008-12-08 Thread Donna L Gresh
I have a need to get the list of all "empid"s (defined by me) in the index so that I can remove the ones that are "stale" by my definition; in this snippet I'm returning all the "empids" for later processing, but the core is very simple. public Vector getIndexIds() throws Exception {

Re: TopDocs - Get all docs?

2008-12-08 Thread Donna L Gresh
Erick- Thanks for the pointer; in my app the difference is between 30 milliseconds and 45 milliseconds (and this is a once-a-day kind of thing), but hey it's always worth doing something the better way in case my index ever gets a whole lot bigger or the use case changes-- thanks. Do

Re: TopDocs - Get all docs?

2008-12-17 Thread Donna L Gresh
Thanks- Yes in my use-case there are never any deleted documents when the search is run- (deletion takes place in a pre-processing stage) Toke Eskildsen wrote on 12/17/2008 08:16:31 AM: > On Mon, 2008-12-08 at 15:17 +0100, Donna L Gresh wrote: > > public Vector getIndexIds

Re: Testing Precision and Recall on Lucene

2009-01-15 Thread Donna L Gresh
I don't think this question makes a whole lot of sense in isolation-- precision and recall is all about the *query* and that is the art of the developer; what is the appropriate query for your particular application. Lucene does just great telling you which documents had which terms and which t

Is there a way for me to handle a multiword synonym correctly?

2009-08-07 Thread Donna L Gresh
I saw some discussion on the board but I'm not sure I've got quite the same problem. As an example, I have a query that might be a technical skill: SAP EM FIN AM I would like that to match a document that has *either* SAP.EM.FIN.AM or "SAP EM FIN AM" (in that order and all together, not spread

RE: Is there a way for me to handle a multiword synonym correctly?

2009-08-07 Thread Donna L Gresh
t; This is the same as it will index "SAP EM FIN AM" as long as you break > on whitespace too. I.E SimpleAnalyzer (runs of letter characters are > tokens) > > Then the query for "SAP EM FIN AM" will match both. > > Carl > > > -Original Message

Re: Lucene Search result (scoring )

2007-06-15 Thread Donna L Gresh
Your examples are a little confusing to read. However, I think one thing that you need to know is that the score (by "default") depends on more than just the number of hits. It also depends on the length of the document the hits are in. For example, matching two words in a two-word-long documen

question about flush(), optimize(), and deleted documents

2007-07-19 Thread Donna L Gresh
I have run into problems with an error that I am trying to access a deleted document when doing something along the lines below; my brief question is, what is necessary to avoid "seeing" deleted documents? Is an optimize() necessary? Or will a flush() or close() accomplish the same thing? Ind

Re: question about flush(), optimize(), and deleted documents

2007-07-19 Thread Donna L Gresh
oes >not include deletes, but the document() call will retrieve deletes. You >might try using maxDoc() instead of numDocs(). Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

special handling of certain terms with embedded periods

2007-08-09 Thread Donna L Gresh
zer(reader))), StandardAnalyzer.STOP_WORDS), engine),"English" ); return result; } Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: special handling of certain terms with embedded periods

2007-08-09 Thread Donna L Gresh
that *do* appear in my index. But your point about the StandardAnalyzer being slow is well-taken, and I'll keep that in mind. Also, the straighforward substitution before indexing and searching is a reasonable approach to keep in mind. Thanks- Donna Donna L. Gresh Services Research, Ma

Re: How to implement cut of score ?

2007-08-13 Thread Donna L Gresh
would not expect good matches, and vice versa. So it does seem (again anecdotally) that the score has *some* relevance. What are the experts' thoughts on this? Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Question about highlighting returning nothing

2007-08-15 Thread Donna L Gresh
I'm working on refining my stopwords by looking at the highest scoring document returned for each search, and using the highlighter to show which terms were significant in choosing that document. This has been extremely helpful in improving my searches. I've noticed though that sometimes the hi

Re: Question about highlighting returning nothing

2007-08-15 Thread Donna L Gresh
Well, in my case the highlighting was returning nothing because of (my favorite acronym) PBCAK-- I don't store the text in the index, so I have to retrieve it separately (from a database) for the highlighting, and my database was not in sync with the index, so in a few cases the document in the

Re: Question about highlighting returning nothing

2007-08-16 Thread Donna L Gresh
ghlighted text. Once I updated my index to be in synch with the database, I no longer had any empty returns from the highlighter. Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMA

tell snowballfilter not to stem certain words?

2007-08-16 Thread Donna L Gresh
Apologies if this is in the FAQ or elsewhere available but I could not find this. Can I provide a list of words that should *not* be stemmed by the SnowballFilter? My analyzer looks like this: analyzer = new StandardAnalyzer(stopwords) { public TokenStream tokenStream(String fieldName, j

MoreLikeThis and stopword stemming

2007-10-10 Thread Donna L Gresh
h is not what I want. Thanks in advance Donna Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

sanity check on how stemming, stopwords, and snowball analyzer works together

2007-10-15 Thread Donna L Gresh
Could those "in the know" comment on my current understanding of stemming and stopwords using the snowball analyzer? In my application, I am using the MoreLikeThis class to find similar documents to an input "text blob". There are words in the input text blob which are "uninteresting" for my ap

Re: sanity check on how stemming, stopwords, and snowball analyzer works together

2007-10-15 Thread Donna L Gresh
is, only the one provided will be removed, correct? Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] Mark Miller <[EMAIL PROTECTED]> wrote on 10/15/2007 10:37:

Miles Efron asked about "IncompatibleClassChangeError" last december

2007-10-22 Thread Donna L Gresh
a on the machine: java -version java version "1.4.2" gcj (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3) If anyone has thoughts on what might be the matter, I'd be most grateful. Thanks, -Miles Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Researc

Re: 2/3 of terms matched + coverage filter

2007-10-31 Thread Donna L Gresh
erlap) Implemented as overlap / maxOverlap. Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] Tobias Hill <[EMAIL PROTECTED]> wrote on 10/31/2007 09:51:12 AM: > My

Re: problem understanding the hits.score

2007-11-02 Thread Donna L Gresh
arch/Similarity.html Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: Scoring for all the documents in the index relative to a query

2007-11-19 Thread Donna L Gresh
I could be mistaken, but I think the earlier answer was right; a document with no terms matching has a score of 0, so you can assume that all documents NOT returned by the query have a score of 0. If you look at the scoring formula on this page, it is hard to see how you can get a negative scor

MoreLikeThis and setBoost

2007-11-20 Thread Donna L Gresh
I've been stepping through the contrib MoreLikeThis class and was wondering if people can give opinions on why you would or would not use setBoost(true) for the MoreLikeThis object. It seems a bit odd (at least to me) to boost the "good" terms in the query (based on the term's score), since won

Re: FSDirectory Again

2007-11-30 Thread Donna L Gresh
IndexWriter's create flag, instead, to create a new index. Here is the code I use Directory directory = FSDirectory.getDirectory( directoryName); IndexWriter iwriter = new IndexWriter(directory, analyzer, reCreate); Donna L. Gresh Services Research, Mathema

RE: How do i get a text summary

2008-02-28 Thread Donna L Gresh
I think you may want to look into the Highlighter. It allows you to show the "relevant" bits of the document which contributed to the document being matched to the query. It does a pretty good job. Of course it does not create a "summary" but it does give you a good idea of why the document was

Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Donna L Gresh
I believe something like the following will do what you want: QueryParser parserTitle = new QueryParser("title", analyzer); QueryParser parserAuthor = new QueryParser("author", analyzer); BooleanQuery overallquery = new BooleanQuery(); BolleanQuery firstQuery = new BooleanQuery(); Query q1= pars

Re: MultiFieldQueryParser - BooleanClause.Occur

2008-02-29 Thread Donna L Gresh
ary 2008 18:04:47 schreef Donna L Gresh: > > I believe something like the following will do what you want: > > > > QueryParser parserTitle = new QueryParser("title", analyzer); > > QueryParser parserAuthor = new QueryParser("author", analy

C++ as token in StandardAnalyzer?

2008-03-04 Thread Donna L Gresh
I saw some discussion in the archives some time ago about the fact that C++ is tokenized as C in the StandardAnalyzer; this seems to still be the case; I was wondering if there is a simple way for me to get the behavior I want for C++ (that it is tokenized as C++) in particular, and perhaps for

applying patch in Eclipse to get SpanHighlighter functionality

2008-03-05 Thread Donna L Gresh
I have downloaded the Lucene (core, 2.3.1) code and created a project using Eclipse (pointing to src/java) to use it. That works fine, along with the contrib highlighter jar file from the standard distribution. I have also successfully added an additional Eclipse project for the (standard) High

Re: applying patch in Eclipse to get SpanHighlighter functionality

2008-03-05 Thread Donna L Gresh
though you can still use > them with the ignore stuff it has. > > Donna L Gresh wrote: > > I have downloaded the Lucene (core, 2.3.1) code and created a project > > using Eclipse (pointing to src/java) to use it. That works fine, along > > with the contrib highlighte

intuitive explanation for what seems like odd result?

2008-04-01 Thread Donna L Gresh
than the second, but I was wondering if anyone out there can give a simple explanation for why it would differ for these two queries. I use the DefaultSimilarity class. Many thanks in advance-- Donna Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: intuitive explanation for what seems like odd result?

2008-04-01 Thread Donna L Gresh
.667441 = idf(docFreq=883, numDocs=34610)\n 0.073388316 = queryNorm\n 0.12762533 = (MATCH) fieldWeight(text:soa in 13588), product of:\n 1.0 = tf(termFreq(text:soa)=1)\n 4.667441 = idf(docFreq=883, numDocs=34610)\n 0.02734375 = fieldNorm(field=text, doc=13588)

Re: Adding attribute to index

2008-04-02 Thread Donna L Gresh
ent String emailSender = doc.get("sender"); String emailText = doc.get("emailText"); Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] "Nitasha W

Re: WildCardQuery and TooManyClauses

2008-04-10 Thread Donna L Gresh
Doesn't the following do what you want with maxnumhits =200? TopDocs td; td = indexSearcher.search(query, filter, maxnumhits); where filter can be null Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914

question about getting all terms in a section of the documents

2007-03-19 Thread Donna L Gresh
I am a very new user of Lucene, and thus far am amazed at its speed and ease of use. I have a question about something in the FAQ though. I have a need to get all terms in a specific section of the document; I want to create a database of term vs an identifier of the document containing the term

Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Donna L Gresh
his makes sense; I'm welcome to suggestions of better ways to do this) Donna L. Gresh

Re: Obtaining the (indexed) terms in a field in a particular document

2007-03-20 Thread Donna L Gresh
r unique id (my_id in your example). Erick On 3/20/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > You can do a document.get(field), *assuming* you have stored the data > (Field.Store.YES) at index time, although you may not get > stop words. > > On 3/20/07, Donna

normalized scores

2007-03-29 Thread Donna L Gresh
Recent questions about whether/how scores are normalized got me wondering how my application (happily) seems to be doing what I want. I have two indexes, one which contains text fields which I want to use as queries into text fields in a second index. I create a Boolean query based on all the t

Re: normalized scores

2007-03-30 Thread Donna L Gresh
I'm well aware that some queries will return no results due to my filtering by 0.3. That's the point. I expect that some of my input queries will not be a good match to *any* of the documents in my second index. I'm really doing something much like the "Books Like This" example in Chapter 5 of

Re: normalized scores

2007-03-30 Thread Donna L Gresh
Try out the TopDocs returning ones or use a HitCollector. Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: IndexReader.deleteDocuement(); How to use it with our code??

2007-04-10 Thread Donna L Gresh
but in the other you have indexDir="E:/eclipse/310307/objtest/crawl-result/index/" Notice the difference between "indexes" and "index". The error message is explicitly saying that the index directory does not exist. Donna L. Gresh Services Research, Mathematical Scie

Re: Newbie needs help "addField"

2007-04-18 Thread Donna L Gresh
Parser("text", analyzer); String term = "searchterm"; Query query = parser.parse(term); Hits hits = isearcher.search(query); for (int i=0; i< hits.length(); i++) { Document hitDoc = hits.doc(i); String id = hitDoc.get("id"); } Donna L. Gresh Services Research, Mathemati

MoreLikeThis?

2007-05-22 Thread Donna L Gresh
e. What do I need to do to try this package out? thanks in advance-- Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: MoreLikeThis?

2007-05-23 Thread Donna L Gresh
Thank you-- Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED] Otis Gospodnetic <[EMAIL PROTECTED]> 05/22/2007 05:33 PM Please respond to jav

restricting hits to a subset of "id"s

2007-05-30 Thread Donna L Gresh
eople who are in the original input list is to simply use Lucene as it is, getting all the hits I need, and then only returning out of the application those on the original input list. Does this seem appropriate? Thanks in advance for any pointers-- Donna L. Gresh Services Research, Mathemati

Re: restricting hits to a subset of "id"s

2007-05-31 Thread Donna L Gresh
ids, add them to a BitSet, and make >a Filter with that. >You might want to check out CachingWrapperFilter or QueryFilter too. Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: boosting different parts of the same field

2007-05-31 Thread Donna L Gresh
something like below would >work maybe this is a silly question but why not create a title field and a description field and boost them separately? Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.

Re: Indexing MSword Documents

2007-06-08 Thread Donna L Gresh
Index.TOKENIZED); doc.add(textField); Donna L. Gresh Services Research, Mathematical Sciences Department IBM T.J. Watson Research Center (914) 945-2472 http://www.research.ibm.com/people/g/donnagresh [EMAIL PROTECTED]

Re: uncorrect results

2010-11-17 Thread Donna L Gresh
use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6 > 1.8.2) on an 64 bit Linux machine. > [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]

Re: please help

2011-07-21 Thread Donna L Gresh
This is really not the forum for questions like this (which are not related to Lucene but rather to Java) but for a very simple checklist of what you need, try this: http://download.oracle.com/javase/tutorial/getStarted/cupojava/win32.html But I ask that any further questions which are purely j

Re: Using Lucene to match document sets to each other

2011-12-16 Thread Donna L Gresh
the stuff that doesn't change much, and use the things that are constantly changing as the query. Donna L. Gresh Business Analytics and Mathematical Sciences IBM T.J. Watson Research Center (914) 945-2472 https://researcher.ibm.com/researcher/view.php?person=us-gresh gr...@us.ibm.com F