Re: Analyzing Advise

2005-02-18 Thread Steven Rowe
Luke Shannon wrote:
But now that I'm looking at the API I'm not sure I can specifiy a
different analyzer when creating a field.
Is PerFieldAnalyzerWrapper what you're looking for?
URL:http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/PerFieldAnalyzerWrapper.html
Steve
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Problem searching Field.Keyword field

2005-02-08 Thread Steven Rowe
Why is there no KeywordAnalyzer?  That is, an analyzer which doesn't 
mess with its input in any way, but just returns it as-is?

I realize that under most circumstances, it would probably be more code 
to use it than just constructing a TermQuery, but having it would 
regularize query handling, and simplify new users' experience.  And for 
the purposes of the PerFieldAnalyzerWrapper, it could be helpful.

Steve
Erik Hatcher wrote:
Kelvin - I respectfully disagree - could you elaborate on why this is 
not an appropriate use of Field.Keyword?

If the category is How To, Field.Text would split this (depending on 
the Analyzer) into how and to.

If the user is selecting a category from a drop-down, though, you 
shouldn't be using QueryParser on it, but instead aggregating a 
TermQuery(category, How To) into a BooleanQuery with the rest of 
it.  The rest may be other API created clauses and likely a piece from 
QueryParser.

Erik
On Feb 8, 2005, at 11:28 AM, Kelvin Tan wrote:
As I posted previously, Field.Keyword is appropriate in only certain 
situations. For your use-case, I believe Field.Text is more suitable.

k
On Tue, 8 Feb 2005 10:02:19 -0600, Mike Miller wrote:
 This may or may not be correct, but I am indexing it as a keyword
 because I provide a (required) radio button on the add screen for
 the user to determine which category the document should be
 assigned.  Then in the search, provide a dropdown that can be used
 in the advanced search so that they can search only for a specific
 category of documents (like HowTo, Troubleshooting, etc).
 -Original Message-
 From: Kelvin Tan [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 February 08, 2005 9:32 AM To: Lucene Users List
 Subject: RE: Problem searching Field.Keyword field
 Mike, is there a reason why you're indexing category as keyword
 not text?
 k
 On Tue, 8 Feb 2005 08:26:13 -0600, Mike Miller wrote:
 Thanks for the quick response.
 Sorry for my lack of understanding, but I am learning!  Won't the
  query parser still handle this query?  My limited understanding
 was  that the search call provides the 'all' field as default
 field for  query terms in the case where fields aren't specified.
   Using the  current code, searches like author:Mike and
 title:Lucene work fine.
 -Original Message-
 From: Miles Barr [mailto:[EMAIL PROTECTED] Sent:  
 Tuesday, February 08, 2005 8:08 AM To: Lucene Users List Subject:
  Re: Problem searching Field.Keyword field

 You're using the query parser with the standard analyser. You  
 should construct a term query manually instead.

 --
 Miles Barr [EMAIL PROTECTED] Runtime Collective Ltd.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Optimize not deleting all files

2005-02-04 Thread Steven Rowe
Hi Patricio,
Is it the case that the old index files are not removed from session to
session, or only within the same session?  The discussion below pertains to
the latter case, that is, where the old index files are used in the same
process as the files replacing them.
I was having a similar problem, and tracked the source down to IndexReaders
not being closed in my application.  

As far as I can tell, in order for IndexReaders to present a consistent
view of an index while changes are being made to it, read-only copies
of the index are kept around until all IndexReaders using them are
closed.  If any IndexReaders are open on the index, IndexWriters first
make a copy, then operate on the copy.  If you track down all of these
open IndexReaders and close them before optimization, all of the
old index files should be deleted.  (Lucene Gurus, please correct this
if I have misrepresented the situation).
In my application, I had a bad interaction between IndexReader caching,
garbage collection, and incremental indexing, in which a new IndexReader
was being opened on an index after each indexing increment, without
closing the already-opened IndexReaders.
On Windows, operating-system level file locking caused by IndexReaders
left open was disallowing index re-creation, because the IndexWriter
wasn't allowed to delete the index files opened by the abandoned
IndexReaders.
In short, if you need to write to an index more than once in a single
session, be sure to keep careful track of your IndexReaders.
Hope it helps,
Steve
Patricio Keilty wrote:
Hi Otis, tried version 1.4.3 without success, old index files still 
remain in the directory.
Also tried not calling optimize(), and still getting the same behaviour, 
maybe our problem is not related to optimize() call at all.

--p
Otis Gospodnetic wrote:
Get and try Lucene 1.4.3.  One of the older versions had a bug that was
not deleting old index files.
Otis
--- [EMAIL PROTECTED] wrote:

Hi,
When I run an optimize in our production environment, old index are
left in the directory and are not deleted. 
My understanding is that an
optimize will create new index files and all existing index files
should be
deleted.  Is this correct?

We are running Lucene 1.4.2 on Windows. 

Any help is appreciated.  Thanks!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene docs

2004-09-15 Thread Steven Rowe
URL:http://wiki.apache.org/jakarta-lucene/IntroductionToLucene
Ian McDonnell wrote:
What is the best resource for beginners looking to understand
Lucenes functionality, ie its use of fields, documents, the index
reader and writer etc.
is there any web resource that goes into details on the exact
workings of it?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: worddoucments search

2004-08-30 Thread Steven Rowe
Hi Lisheng,
You missed a fork in this topic posted on August 24th.  It answers all 
your questions and debunks the textmining wraps POI myth:

URL:http://www.mail-archive.com/[EMAIL PROTECTED]/msg09168.html
Steve
Zhang, Lisheng wrote:
Hi Otis,
I looked at textmining site, it seems to me textmining
is a wrapper on the top of POI, so the basic features
should be the same as POI, is this true?
I have tested POI with lucene, in general it works fine, 
but I found sometimes it cannot process some MSDOC files
created from old version. But if I just save the old
DOC file by new Word on XP, eveything is fine.

Thanks very much for helps, 

Lisheng
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 10:24 AM
To: Lucene Users List
Subject: Re: worddoucments search
As I just answered in a separate email to Ryan - we used textmining.org
library, too, as an example of something that is easier to use than
POI.  It's been a while since I wrote that chapter, so it slipped my
mind when I replied.  Yes, use textmining.org first, you'll be able to
include it in your code in 2 minutes.  Good stuff.
Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Introduction to Lucene [was Re: worddoucments search]

2004-08-25 Thread Steven Rowe
A collection of links to introductory level Lucene articles (including 
one in simplified Chinese and one in Turkish) is available on the 
Lucene Wiki at:

URL:http://wiki.apache.org/jakarta-lucene/IntroductionToLucene
Steve
Otis Gospodnetic wrote:
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.
Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.
Otis
--- Santosh [EMAIL PROTECTED] wrote:
I have gon through textmining.org, I am able to extract text in
string format. but how can I get it as lucene document format
- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search
As I just answered in a separate email to Ryan - we used
textmining.orglibrary, too, as an example of something that is easier
to use thanPOI.  It's been a while since I wrote that chapter, so it
slipped mymind when I replied.  Yes, use textmining.org first, you'll
be able toinclude it in your code in 2 minutes.  Good stuff.
Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: question on setting boost factor

2004-07-01 Thread Steven Rowe
Repaired URL (was extra space before Similarity.html):
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html#coord(int,%20int)
Corresponding Tiny URL:
URL:http://tinyurl.com/3bo8y
Erik Hatcher wrote:
On Jun 22, 2004, at 7:30 AM, Anson Lau wrote:
Hi guys,
Lets say I want to search the term hello world over 3 fields with
different boost:
((hello:field1 world:field1)^0.001 (hello:field2 world:field2)^100
(hello:field3 world:field3)^2))
Note I've given field1 a really low boost, a heavy boost to field2 and  a
REALLY heavy boost to field3.
What is happening to me is that a term that matches both field1 and  
field2,
will have a higher score than a term that matches field3 only, even  
though
field3's boost is WAY higher.

Can I change this behaviour such that the match in field3 only will  
actually
have a higher score because of the boost?

First step is to get familiar with the actual factors coming out in the  
IndexSearcher.explain() output (just System.out.println the Explanation  
object).  The coord() factor -  
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ 
Similarity.html#coord(int,%20int) - is what you'll want to tweak to  
change how scores are affected when multiple terms match by creating  
your own DefaultSimilarity sublass (and probably just returning 1.0).   
Read the javadocs for Similarity to see how to hook in your own  
implementation (see also section).

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: escaping special characters while doing search doesn't seem to work

2004-06-30 Thread Steven Rowe
Hi Polina,
Try this (jGuru Lucene FAQ item):
URL:http://www.jguru.com/faq/view.jsp?EID=538308
Or, better yet, this (the Lucene Wiki AnalysisParalysis page):
URL:http://wiki.apache.org/jakarta-lucene/AnalysisParalysis
Steve
Polina Litvak wrote:
I was trying to search my index for a term of the form a*-b* (e.g.
ABC-DEFG). While tracing the code I noticed that Lucene breaks this term
into two terms, ABC and DEFG. To prevent this, I tried escaping the
special character - with \ to form the term ABC\-DEFG and now
Lucene search can't find this term in the index.
 
Does anyone know of this already ? Is this a bug, or I am doing
something wrong ?
 
Thanks, 
Polina


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: [Fwd: PROPOSAL: Lucene external content store for stored fields]

2004-06-15 Thread Steven Rowe
Kevin,
I think that this sort of thing should be built on top of the 
functionality provided by the binary fields proposal, or at least made 
to work with it:

URL:http://issues.apache.org/bugzilla/show_bug.cgi?id=29370
This would take care of the blob-vs.-text aspect of your proposal.
Also:
Kevin Burton wrote:
Supporting full unicode is important.  Full java.lang.String storage is 
used with String.getBytes() so we should be able to avoid unicode issues.
If Java has a correct java.lang.String representation it's possible easily
add unicode support just by serializing the byte representation. (Note
that the JDK says that the DEFAULT system char encoding is used so if this
is ever changed it might break the index)
It's a bad idea to use the zero-parameter version of String.getBytes() 
(for example, what if you want to share an index between two platforms 
with different DEFAULT system char encodings?).  Fortunately, there's a 
better alternative: for the suprisingly low price of 
String.getBytes(String charsetName), platform independence can be 
yours today.

Steve
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where does the name lucene come from?

2004-05-05 Thread Steven Rowe
Til Schneider wrote:
Hi,

Working now for a few months with this really great search engine, I was 
wondering where the name Lucene comes from? What does it mean? Is 
there any deeper sense?
Doug Cutting's response:
URL:http://tinyurl.com/2hh5c
(full original URL: 
URL:http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=961817
)

Otis, shouldn't this be an FAQ?

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]