ChadDavis <[EMAIL PROTECTED]> wrote on 11/10/2008 02:22:45 PM:
> In the FAQ's it says that you have to do a manual incremental update:
>
> How do I update a document or a set of documents that are already
indexed?
> >
> > There is no direct update procedure in Lucene. To update an index
> > incr
Check out POI; that's what I use
http://poi.apache.org/
"Sertic Mirko, Bedag" <[EMAIL PROTECTED]> wrote on 11/12/2008 03:25:47
AM:
> Hi
>
> You can also use a tool called "antiword" to extract the text from a
> .doc file, and then
> give the text to lucene.
>
> See here : http://en.wikipedia
I have a need to get the list of all "empid"s (defined by me) in the index
so that I can remove the ones that are "stale" by my definition; in this
snippet I'm returning all the "empids" for later processing, but the core
is very simple.
public Vector getIndexIds() throws Exception {
Erick-
Thanks for the pointer; in my app the difference is between 30
milliseconds and 45 milliseconds (and this is a once-a-day kind of thing),
but hey it's always worth doing something the better way in case my index
ever gets a whole lot bigger or the use case changes-- thanks.
Do
Thanks-
Yes in my use-case there are never any deleted documents when the search
is run- (deletion takes place in a pre-processing stage)
Toke Eskildsen wrote on 12/17/2008 08:16:31 AM:
> On Mon, 2008-12-08 at 15:17 +0100, Donna L Gresh wrote:
> > public Vector getIndexIds
I don't think this question makes a whole lot of sense in isolation--
precision and recall is all about the *query* and that is the art of the
developer; what is the appropriate query for your particular application.
Lucene does just great telling you which documents had which terms and
which t
I saw some discussion on the board but I'm not sure I've got quite the
same problem. As an example, I have a query that might be a technical
skill:
SAP EM FIN AM
I would like that to match a document that has *either* SAP.EM.FIN.AM or
"SAP EM FIN AM" (in that order and all together, not spread
t; This is the same as it will index "SAP EM FIN AM" as long as you break
> on whitespace too. I.E SimpleAnalyzer (runs of letter characters are
> tokens)
>
> Then the query for "SAP EM FIN AM" will match both.
>
> Carl
>
>
> -Original Message
Your examples are a little confusing to read. However, I think one thing
that you need to know is that the score (by "default") depends on more
than just the number of hits. It also depends on the length of the
document the hits are in. For example, matching two words in a
two-word-long documen
I have run into problems with an error that I am trying to access a
deleted document when doing something along the lines below; my brief
question is, what is necessary to avoid "seeing" deleted documents? Is an
optimize() necessary? Or will a flush() or close() accomplish the same
thing?
Ind
oes
>not include deletes, but the document() call will retrieve deletes. You
>might try using maxDoc() instead of numDocs().
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
zer(reader))),
StandardAnalyzer.STOP_WORDS),
engine),"English"
);
return result;
}
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
that *do* appear in my index.
But your point about the StandardAnalyzer being slow is
well-taken, and I'll keep that in mind. Also, the straighforward
substitution before indexing and searching is a reasonable
approach to keep in mind.
Thanks-
Donna
Donna L. Gresh
Services Research, Ma
would not expect good matches, and vice versa. So it does seem
(again anecdotally) that the score has
*some* relevance. What are the experts' thoughts on this?
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
I'm working on refining my stopwords by looking at the highest scoring
document returned for each search, and using the highlighter to show which
terms were significant in choosing that document. This has been extremely
helpful in improving my searches. I've noticed though that sometimes the
hi
Well, in my case the highlighting was returning nothing because of (my
favorite acronym) PBCAK--
I don't store the text in the index, so I have to retrieve it separately
(from a database) for the highlighting, and my database was not in sync
with the index, so in a few cases the document in the
ghlighted
text. Once I updated my index to be in synch with the database,
I no longer had any empty returns from the highlighter.
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMA
Apologies if this is in the FAQ or elsewhere available but I could not
find this.
Can I provide a list of words that should *not* be stemmed by the
SnowballFilter? My analyzer looks like this:
analyzer = new StandardAnalyzer(stopwords) {
public TokenStream tokenStream(String fieldName, j
h is not what I want.
Thanks in advance
Donna
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Could those "in the know" comment on my current understanding of stemming
and stopwords using the snowball analyzer?
In my application, I am using the MoreLikeThis class to find similar
documents to an input "text blob". There are words in the input text blob
which are "uninteresting" for my ap
is, only the one provided will be removed,
correct?
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Mark Miller <[EMAIL PROTECTED]> wrote on 10/15/2007 10:37:
a on the machine:
java -version
java version "1.4.2"
gcj (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3)
If anyone has thoughts on what might be the matter, I'd be most grateful.
Thanks,
-Miles
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Researc
erlap)
Implemented as overlap / maxOverlap.
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Tobias Hill <[EMAIL PROTECTED]> wrote on 10/31/2007 09:51:12 AM:
> My
arch/Similarity.html
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
I could be mistaken, but I think the earlier answer was right; a document
with no terms matching has a score of 0, so you can assume that all
documents NOT returned by the query have a score of 0. If you look at the
scoring formula on this page, it is hard to see how you can get a negative
scor
I've been stepping through the contrib MoreLikeThis class and was
wondering if people can give opinions on why you would or would not use
setBoost(true) for the MoreLikeThis object. It seems a bit odd (at least
to me) to boost the "good" terms in the query (based on the term's score),
since won
IndexWriter's create flag, instead, to create a
new index.
Here is the code I use
Directory directory = FSDirectory.getDirectory(
directoryName);
IndexWriter iwriter = new IndexWriter(directory, analyzer,
reCreate);
Donna L. Gresh
Services Research, Mathema
I think you may want to look into the Highlighter. It allows you to show
the "relevant" bits of the document which contributed to the document
being matched to the query. It does a pretty good job. Of course it does
not create a "summary" but it does give you a good idea of why the
document was
I believe something like the following will do what you want:
QueryParser parserTitle = new QueryParser("title", analyzer);
QueryParser parserAuthor = new QueryParser("author", analyzer);
BooleanQuery overallquery = new BooleanQuery();
BolleanQuery firstQuery = new BooleanQuery();
Query q1= pars
ary 2008 18:04:47 schreef Donna L Gresh:
> > I believe something like the following will do what you want:
> >
> > QueryParser parserTitle = new QueryParser("title", analyzer);
> > QueryParser parserAuthor = new QueryParser("author", analy
I saw some discussion in the archives some time ago about the fact that
C++ is tokenized as C in the StandardAnalyzer; this seems to still be the
case; I was wondering if there is a simple way for me to get the behavior
I want for C++ (that it is tokenized as C++) in particular, and perhaps
for
I have downloaded the Lucene (core, 2.3.1) code and created a project
using Eclipse (pointing to src/java) to use it. That works fine, along
with the contrib highlighter jar file from the standard distribution.
I have also successfully added an additional Eclipse project for the
(standard) High
though you can still use
> them with the ignore stuff it has.
>
> Donna L Gresh wrote:
> > I have downloaded the Lucene (core, 2.3.1) code and created a project
> > using Eclipse (pointing to src/java) to use it. That works fine, along
> > with the contrib highlighte
than the second, but I
was wondering if anyone out there can give a simple explanation for why it
would differ for these two queries. I use the DefaultSimilarity class.
Many thanks in advance--
Donna
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
.667441 = idf(docFreq=883, numDocs=34610)\n
0.073388316 = queryNorm\n
0.12762533 = (MATCH) fieldWeight(text:soa in 13588), product of:\n
1.0 = tf(termFreq(text:soa)=1)\n
4.667441 = idf(docFreq=883, numDocs=34610)\n
0.02734375 = fieldNorm(field=text, doc=13588)
ent
String emailSender = doc.get("sender");
String emailText = doc.get("emailText");
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
"Nitasha W
Doesn't the following do what you want with maxnumhits =200?
TopDocs td;
td = indexSearcher.search(query, filter, maxnumhits);
where filter can be null
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914
I am a very new user of Lucene, and thus far am amazed at its speed and
ease of use.
I have a question about something in the FAQ though. I have a need to get
all
terms in a specific section of the document; I want to create a database
of term vs
an identifier of the document containing the term
his makes sense; I'm welcome to
suggestions of better ways to do this)
Donna L. Gresh
r unique id (my_id in
your example).
Erick
On 3/20/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> You can do a document.get(field), *assuming* you have stored the data
> (Field.Store.YES) at index time, although you may not get
> stop words.
>
> On 3/20/07, Donna
Recent questions about whether/how scores are normalized got me wondering
how
my application (happily) seems to be doing what I want. I have two
indexes, one
which contains text fields which I want to use as queries into text fields
in a second index.
I create a Boolean query based on all the t
I'm well aware that some queries will return no results due to my
filtering by 0.3.
That's the point. I expect that some of my input queries will not be a
good match
to *any* of the documents in my second index.
I'm really doing something much like
the "Books Like This" example in Chapter 5 of
Try out the TopDocs returning ones or use a HitCollector.
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
but in the other you have
indexDir="E:/eclipse/310307/objtest/crawl-result/index/"
Notice the difference between "indexes" and "index".
The error message is explicitly saying that the index directory does not
exist.
Donna L. Gresh
Services Research, Mathematical Scie
Parser("text", analyzer);
String term = "searchterm";
Query query = parser.parse(term);
Hits hits = isearcher.search(query);
for (int i=0; i< hits.length(); i++) {
Document hitDoc = hits.doc(i);
String id = hitDoc.get("id");
}
Donna L. Gresh
Services Research, Mathemati
e.
What do I need to do to try this package out?
thanks in advance--
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Thank you--
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
Otis Gospodnetic <[EMAIL PROTECTED]>
05/22/2007 05:33 PM
Please respond to
jav
eople
who are in the original input list
is to simply use Lucene as it is, getting all the hits I need, and then
only returning out of the application those on
the original input list. Does this seem appropriate?
Thanks in advance for any pointers--
Donna L. Gresh
Services Research, Mathemati
ids, add them to a BitSet, and make
>a Filter with that.
>You might want to check out CachingWrapperFilter or QueryFilter too.
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
something like below would
>work
maybe this is a silly question but why not create a title field and a
description field and
boost them separately?
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.
Index.TOKENIZED);
doc.add(textField);
Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]
use Lucene 3.0.2 and the OpenJDK Runtime Environment (IcedTea6
> 1.8.2) on an 64 bit Linux machine.
> [attachment "signature.asc" deleted by Donna L Gresh/Watson/IBM]
This is really not the forum for questions like this (which are not
related to Lucene but rather to Java) but for a very simple checklist of
what you need, try this:
http://download.oracle.com/javase/tutorial/getStarted/cupojava/win32.html
But I ask that any further questions which are purely j
the stuff that doesn't change
much, and use the
things that are constantly changing as the query.
Donna L. Gresh
Business Analytics and Mathematical Sciences
IBM T.J. Watson Research Center
(914) 945-2472
https://researcher.ibm.com/researcher/view.php?person=us-gresh
gr...@us.ibm.com
F
54 matches
Mail list logo