BTW, what's wrong with the DateFilter solution, I mentionned earlier?
I've used it before (before lucene-1.4 though) without memory problems,
thus I always assumed that it avoided the allocation problems with prefix
queries.
sv
On Mon, 4 Oct 2004, Chris Fraschetti wrote:
Surely some folks out
current
problem is in regards to a stand alone a* query giving me
the too many clauses exception
On Mon, 4 Oct 2004 12:47:24 -0400 (EDT), Stephane James Vaucher
[EMAIL PROTECTED] wrote:
BTW, what's wrong with the DateFilter solution, I mentionned earlier?
I've used it before
to
prevent the user from doing a* but not comment* wanting comments
or commentary... any suggestions would be warmly welcomed.
On Mon, 4 Oct 2004 14:08:00 -0400 (EDT), Stephane James Vaucher
[EMAIL PROTECTED] wrote:
Ok, got it, got a small comment though.
For large wildcard queries
))
But when wildcard is used , it is taken as
+ITM_SHRT_DSC:nat* tylenol +ITM_LONG_DSC:nat* Tylenol
ITM_XXX fields are tokenized?
sv
the first return around 300 records , the second , 0.
any help would be appreciated
Thanks
Robin
On Fri, 1 Oct 2004 02:06:04 -0400 (EDT), Stephane James
Can you be a little more precise about how you process your documents?
1) What's your analyser? SimpleAnalyzer?
2) How do you parse the query? Out-of-the-box QueryParser?
can we not enter space or do an OR search with two words one of which
has a wildcard ?
Simple answer, yes.
Complicated
How about a DateFilter?
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/DateFilter.html
I don't believe it's got the same restrictions as boolean queries.
HTH,
sv
On Thu, 30 Sep 2004, Chris Fraschetti wrote:
I recently read in regards to my problem that
Hi Tate,
From the commit:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06510.html
I'd say you can use the german WordListLoader (renaming it or using a
nightly cvs version of the refactored class). I think there might be a
versionning issue here as from:
. So you are
saying that I should have it as Field.Text?
Also, if I need to just search on that title but want to retrieve the
value of another field content, then title should be unstored while
content should be stored?
Regards,
Niraj
- Original Message -
From: Stephane James
On Wed, 1 Sep 2004, Niraj Alok wrote
I was also thinking on the same lines.
Actually the original code was written by some one else who has left and so
I have to own this.
At almost all the places, it is Field.Text and at some few places its
Field.UnIndexed.
I looked at the javadocs and
A description on how to search numerical fields is available on the wiki:
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields
sv
On Thu, 26 Aug 2004, Alex Kiselevski wrote:
Thanks, I'll try it
-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Thursday,
You'll have to give us more information than that...
What is the problem you are seeing? I'll assume that you get no results.
Tell us of the structure of your documents and how you index every field.
Concerning your syntax, if you are using the distributed query parser, you
don't need the +
-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 10:07 AM
To: Lucene Users List
Subject: Re: what is wrong with query
You'll have to give us more information than that...
What is the problem you are seeing? I'll assume that you get no results.
Tell us
I don't think that the demo parser is meant as a production
system component. You can look at Tidy or NekoHtml. They cleanup your html
and are probably optimised.
sv
On Wed, 25 Aug 2004, Hetan Shah wrote:
Hello all,
Is there a way to reduce the indexing time taken when the indexer is
/HtmlDocument.java?rev=1.3view=markup
HTH,
sv
On Wed, 25 Aug 2004, Hetan Shah wrote:
Do you have any pointers for sample code for them?
Would highly appreciate it.
Thanks.
-H
Stephane James Vaucher wrote:
I don't think that the demo parser is meant as a production
system component. You can look
Hetan,
If you are using a corpus with multiple editors, I suggest that you
use a cleaner like tidy as there might be weird stuff appearing in the
html.
sv
On Thu, 26 Aug 2004, Karthik N S wrote:
Hi Hetan
Th's the major Problem of non Standatrdized Tags for HTML Document's
u are
You need to add log4j to your classpath:
http://logging.apache.org/log4j/docs/
sv
On 24 Aug 2004, sivalingam T wrote:
Hi
I have written one files for PDF Indexing. Here I have written as follows ..
This is my IndexPDF file.
import org.apache.lucene.analysis.standard.StandardAnalyzer;
Your classpath should point to a directory that contains log4j.properties,
not the file directly, see below.
sv
On Mon, 23 Aug 2004, Santosh wrote:
Hi natarajan,
I kept log4j.properties in the classpath
my new classpath is
C:\j2sdk1.4.1\lib\log4j.properties;
should be C:\j2sdk1.4.1\lib\
Hi Simon,
Does this work? From FSDirectory api:
If the system property 'disableLuceneLocks' has the String value of
true, lock creation will be disabled.
Otherwise, I think there was a Read-Only Directory hack:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
HTH,
sv
On Mon, 23
as I tried using it and its not recognising the
class?
Many Thanks
Simon
- Original Message -
From: Stephane James Vaucher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 2:22 PM
Subject: Re: Lucene Search Applet
Hi Simon,
Does
On Mon, 23 Aug 2004, Erik Hatcher wrote:
On Aug 23, 2004, at 10:48 AM, Stephane James Vaucher wrote:
I haven't used it, and I'm a little confused from the code:
/** ...
* pIf the system property 'disableLuceneLocks' has the String value
of
* true, lock creation will be disabled
Stupid question:
Are you sure you have the right number of docs in your index? i.e. you're
not adding the same document twice into or via your tmp index.
sv
On Thu, 19 Aug 2004, Rob Jose wrote:
Paul
Thank you for your response. I have appended to the bottom of this message
the field
From: Doug Cutting
http://www.mail-archive.com/[EMAIL PROTECTED]/msg08757.html
An index typically requires around 35% of the plain text size.
I think it's a little big.
sv
On Wed, 18 Aug 2004, Rob Jose wrote:
Hello
I have indexed several thousand (52 to be exact) text files and I keep
On Tue, 17 Aug 2004, Patrick Burleson wrote:
Forward back to list.
-- Forwarded message --
From: Patrick Burleson [EMAIL PROTECTED]
Date: Tue, 17 Aug 2004 11:30:19 -0400
Subject: Re: Swapping Indexes?
To: Stephane James Vaucher [EMAIL PROTECTED]
Stephane,
Thank you
On Tue, 17 Aug 2004, Patrick Burleson wrote:
On Tue, 17 Aug 2004 13:17:10 -0400 (EDT), Stephane James Vaucher
Actually, I use a IndexWriter in overwrite mode on the master dir and
merge the temp dir. This cleans up the old master.
I'm a bit of a Lucene newbie here, and I am trying
I've tried two options that seem to work:
1) Have a singleton that is responsible that will control your searchers.
This controller can temporarilly redirect your searchers to
c:/temp/myindex, allowing you to copy you index to c:/myindex. After that
process completes, your controller can tell
Other indexing strategies:
- AFAIK, you could probably cheat by multiplying the number of tokens in
headers thus affecting the scoring.
For example:
h1hello world/h1 p foo bar /p
content - hello world hello world foo bar
This is not very tweekable though.
- As Tate suggests, you can also use
What is the exception? Is hits null or the index (i) out of bounds?
sv
On Tue, 3 Aug 2004, xuemei li wrote:
hi,all,
I am using lucene to search.When I use console to run my code it works
fine.But after I put my code to a servlet.It will throw exception.Here is
my exception code:
Assuming you are using a FSDirectory and have the appropriate permissions,
yup.
sv
On Thu, 29 Jul 2004, Ian McDonnell wrote:
Is this done simply by saying:
String indexDirectory = /path of directory you want index to be stored in
Ian
I don't know if this helps, but this is what I do. I believe this is
correct, but I have just finished impl and haven't tested it fully:
- keep referrence to valid searcher
- open a reader on the old index
- open a writer to a tmp Directory (RAM of FS)
- find removed/modified files, remove from
IIRC, if you use a searcher, changes are picked up right away. With a
reader, I would expect it should react the same way.
disclaimerI'm not a lucene guru, I might be wrong/disclaimer
Where I'm less sure is with a FSDirectory, as it uses an internal
RAMDirectory. If two separate processes (within
On Thu, 13 May 2004, Matt Quail wrote:
do you know of any method to reduce the memory consumption of lucene
when searching?
It depends on the complexity of the search, I think. Also, I belive
scoring might use more memory than the search itself (can anyone confirm
this?). For example, I
I've seen this:
http://www.jguru.com/faq/view.jsp?EID=538312
I've seen in the code that there is a method to set lowercasing, but I
need to remove accentuated chars as well. Any suggestions as to which is
preferable, preprocessing the input or subclassing a QueryParser and
redefining
Quick reference:
http://wiki.apache.org/jakarta-lucene/SearchNumericalFields
If you are stuck, you can always encode the long in a string format (the
date formatter in lucene might do this already). Or even, you could also
treat it like a date and use your long like a date filter.
HTH,
sv
On
On Thu, 29 Apr 2004, Tate Avery wrote:
Hello,
I have been reviewing some of the code related to boolean queries and I
wanted to see if my understanding is approximately correct regarding how
they are handled and, more importantly, the limitations.
You can always submit requests for
?
OR is there some way to extend this limit?
OR am I missing something that is clouding my understanding?
Thanks,
Tate
-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 29, 2004 1:10 PM
To: Lucene Users List; [EMAIL PROTECTED]
Cc: [EMAIL
I'm a bit confused why you want this.
As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.
As for memory
I suggest you look at:
http://www.manageability.org/blog/stuff/open-source-web-crawlers-java
From what I know of nutch, it's meant as the basic for a competitor to the
big search engines (i.e. google). For a small web site, it might be
overkill especially if it requires you to build from CVS
I would have to agree with Surya's diagnosis, can you give us details on
your update process?
Please include OS, and if there are some non-java processes (e.g. doing
copies).
cheers,
sv
On Mon, 26 Apr 2004, Nader S. Henein wrote:
Can you give us a bit of background, we've been using Lucene
I don't know what you think of the NewFrontPage, but if you like it, I
could do a switch, renaming old FrontPage to OldFrontPage and the new one
to FrontPage.
Also, if anyone knows how to do this, it would be appreciated. I haven't
figured out yet how to rename/destroy pages (are there
From my experience (that is little experience;)), fields that are not
tokenised, are stored separately. Someone more qualified can surely give
you more details.
You can look at your index with Luke, it might be insightful.
sv
On Thu, 22 Apr 2004, Gerard Sychay wrote:
Hello,
I am wondering
This is not normal behaviour. Normally using a new IndexSearcher should
reflect the modified state of your index. Could you post a more
informative bit of code?
sv
On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:
Hi!
My Searcher's instance it not aware of changes to the index. I even create a
.
Quick comment: you should call .close() on your searcher before removing
the reference. If this causes exceptions in future searches, it would
indicate incorrect cacheing.
HTH,
sv
On Wed, 21 Apr 2004 [EMAIL PROTECTED] wrote:
On Wednesday 21 April 2004 19:20, Stephane James Vaucher wrote
How big is the site?
I mostly use an inhouse solution, but I've used HttpUnit for web scrapping
small sites (because of its high-level api).
Here is a hello world example:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample
For a small/simple site, small modifications to this class could
on win2k. I
called them with Runtime.exec(). There are still problems: all three hang
up sometimes. Otherwise, it worked. I indexed several hunderds of
thousands files in development mode. I never got into production.
Argyn
On Mon, 19 Apr 2004 16:53:41 -0400 (EDT), Stephane James Vaucher
I'll make a copy of the code available on the wiki before it disappears
off the Web.
Now for some info on using OO on a production system:
http://www.oooforum.org/forum/viewtopic.php?t=2913highlight=jurt
summary src=Web, not my experienceOO works well (but is slow), but is
not multi-threaded
Actually, the objective would be to use OO to extract text from MSOffice
formats. If I read your code correctly, your code should only work with OO
as the docs are in xml.
Thanks for the code for OO docs through,
sv
On Mon, 19 Apr 2004, Mario Ivankovits wrote:
Stephane James Vaucher wrote
Since it has been moved to the Sandbox, someone should remove the term
highlighter reference.
http://jakarta.apache.org/lucene/docs/contributions.html
Miscellaneous - Term Highlighter
cheers,
sv
-
To unsubscribe, e-mail:
Anyone try what Joerg suggested here?
http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgNo=6231
sv
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
totally rocks!
=Matt
Stephane James Vaucher wrote:
Hi everyone,
I did a presentation tonight in Montreal at a java users group
metting.
I've got to say that they were maybe 4 companies present that use
Lucene
and find it very useful and simple to use. It lead to the longuest
Hi everyone,
I did a presentation tonight in Montreal at a java users group metting.
I've got to say that they were maybe 4 companies present that use Lucene
and find it very useful and simple to use. It lead to the longuest
discussion (positive that is) I having at the users' group.
So I've got
Wow discussion Lucene in French for 2 1/2 hours has affected my english.
Please ignore spelling mistakes ;), but don't ignore the spirit of the
message.
sv
On Thu, 15 Apr 2004, Stephane James Vaucher wrote:
Hi everyone,
I did a presentation tonight in Montreal at a java users group metting
I'm actually pretty lazy about index updates, and haven't had the need for
efficiency, since my requirement is that new documents should be
available on a next working day basis.
I reindex everything from scatch every night (400,000 docs) and store it
in an timestamped index. When the
I'm wondering if there is interest for a simple spider demo.
I've got an example of how to use HttpUnit to spider on a web site and
have it index it on disk (only html page now). I can send it to the list
if anyone is interested (it's one class, 200 loc).
cheers,
sv
Looks cool, but I've got a question:
How do you handle symlinks on *nix? I think it's stuck in a loop
When indexing my home dir, I see it indexing:
/home/vauchers/.Cirano-gnome/.gnome-desktop/Home directory/.Cirano-gnome/...
cheers,
sv
On Wed, 14 Apr 2004, Peter Becker wrote:
Hello,
we
I've uploaded it to the wiki:
http://wiki.apache.org/jakarta-lucene/HttpUnitExample
dislaimer
It's not anywhere close to production quality, especially since it's based
on a unit test framework.
/disclaimer
sv
On Tue, 13 Apr 2004, Stephane James Vaucher wrote:
I'm wondering
It could be part of you solution, but I don't think so. Let me explain:
I've done this a few times something similar to what you describe. I use
often use HttpUnit to get information. How you process it, it's up
to you. If you want it to be indexed (searchable), you can use Lucene. If
you want
Hello all,
The link to Mark Harwood's highlight package is down, anyone
have any idea where his package would be available?
cheers,
sv
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Added to the wiki, can of course be removed if it's transfered to the
FAQs.
sv
On Sun, 4 Apr 2004, Kevin A. Burton wrote:
Stephane James Vaucher wrote:
Hi Tate,
There is a solution by Erik that pads numbers in the index. That would
allow you to search correctly. I'm not sure about
Hi Tate,
There is a solution by Erik that pads numbers in the index. That would
allow you to search correctly. I'm not sure about decimal, but you could
always add a multiplier.
HTH,
sv
On Fri, 2 Apr 2004, Tate Avery wrote:
Hello,
Is there a way (direct or indirect) to support a field
Another possibility is to add all combinations in a single field.
addField(category, /Science/);
addField(category, /Science/Medicine);
addField(category, /Science/Foo);
addField(category, /Biology);
Your wildcard search should work, and you shouldn't have the problem with
a search /Science/*.
I agree with you that a highlight package should be available directly
from the lucene website. To offer this much-desired feature, having a
dependency on a personal web site seems a little weird to me. It would
also force the community to support this functionality, which would seem
Mark,
Thanks for the update, since I contributed the page, I was going to modify
it (I don't want to force work on other.
sv
On Mon, 29 Mar 2004 [EMAIL PROTECTED] wrote:
Hi Doug,
Thanks for the post. BoostingQuery looks to be cleaner, faster and more generally
useful than my
I've added some information contained on this thread on the wiki.
http://wiki.apache.org/jakarta-lucene/DateRangeQueries
If you wish to add more information, go right ahead, but since I added
this info, I believe it's ultimately my responsibility to maintain it.
sv
On Mon, 29 Mar 2004, Kevin
Are the javadocs available on the site?
I'd like to see the javadocs for lucene-1.4 (specifically SpanQuery)
somewhere on the lucene website. I've subscribed to the users mailing
list, but I've never got a feel for the new version. Is there any way
for this to happen, or should I await 1.4-rc1?
Mark,
I've added a section in the wiki called:
http://wiki.apache.org/jakarta-lucene/CommunityContributions
and have added an entry for your message. If you want to edit the
message, go for it. I believe that the wiki can support attached files if
you want to upload there.
cheers,
sv
On Sun,
I'm personally a fan of a release small but often approach, but what are
the new features available in 1.4 (a list would be nice, on the wiki
perhaps)? Will there be interim builds available to try these new features
out soon?
There seem to be no nightly builds on:
go ahead and do it and await critisism ;)
cheers,
sv
On Fri, 26 Mar 2004, Erik Hatcher wrote:
On Mar 26, 2004, at 3:32 PM, Stephane James Vaucher wrote:
I'm personally a fan of a release small but often approach, but what
are
the new features available in 1.4 (a list would be nice
estimates.
Erik
On Mar 26, 2004, at 6:00 PM, Stephane James Vaucher wrote:
Hello lucene community,
I'll be presenting lucene at the GUJM (Java Users Group of Montreal),
mid-April, could you send me references, articles, presentations not
readily available on the lucene site
On the wiki, I've looked up some reference for lucene community releases
to put under News (http://wiki.apache.org/jakarta-lucene/LatestNews), if
I've missed some, you can modify the page yourself (it's a wiki after
all).
sv
-
Just found the rest of the thread. I'll shut up now ;)
sv
On Sun, 14 Mar 2004, Stephane James Vaucher wrote:
Back from a weeks' vacation, so this reply is a little late, maybe out of
order as well ;). Comment inline:
On Tue, 9 Mar 2004, Kevin A. Burton wrote:
Doug Cutting wrote
Weird idea, how about transforming your long into a Date and using a
DateFilter to use a ranged query?
sv
On Fri, 5 Mar 2004, Erik Hatcher wrote:
Terms in Lucene are text. If you want to deal with number ranges, you
need to pad them.
0001 for example. Be sure all numbers have the
On Fri, 5 Mar 2004 [EMAIL PROTECTED] wrote:
On Friday 05 March 2004 18:01, Erik Hatcher wrote:
0001 for example. Be sure all numbers have the same width
and zero padded.
And what about a range like 100 TO 1000?
You mean 0100 To 1000 or 100 to 0001000 ;)
sv
As I've stated in my earlier mail, I like this change. More importantly,
could this become a standard way of changing configurations at runtime?
For example, the default merge factor could also be set in this manner.
sv
On Wed, 3 Mar 2004, Michael Duval wrote:
I agree with both the
at this, I'll see what I can do in 2 weeks, after my vacations.
Cheers,
sv
On Wed, 3 Mar 2004, Doug Cutting wrote:
Stephane James Vaucher wrote:
As I've stated in my earlier mail, I like this change. More importantly,
could this become a standard way of changing configurations at runtime
I've done something similar to configure my merge factor (but it was
outside my code), and am planning on setting the limit on boolean queries
this way as well. I think it's pretty clean especially if you use
org.apache.lucene.xxx properties with decent default values.
Adding this feature could
Slightly off topic to this thread, but how would adding different fields
with the same name deal with boosts? I've looked at the javadoc and FAQ,
but I think it's not a common use of this feature, any insight?
E.G.
Document doc = new Document();
Field f1 = Field.Keyword(fieldName, foo);
, I can't add my confidence as boosts to the terms as I index
them, do you have any suggestions? I'm guessing that I'll probably have to
add multiple copies of my fields to simulate boosting.
sv
On Fri, 27 Feb 2004, Erik Hatcher wrote:
On Feb 27, 2004, at 6:26 PM, Stephane James Vaucher wrote
On Fri, 13 Feb 2004 [EMAIL PROTECTED] wrote:
Hi!
Can Lucene incrementally update its index (i.e. balancing will a list of docs
and removing those that are no more found)?
Incremental updates (additions and deletions) are possible, but I'm not
sure if I understand your question. Lucene
You should probably take a look at the javadoc:
http://jakarta.apache.org/lucene/docs/api/index.html
As for where to store the index, you'll want to put it somewhere where all
potential users can access it, as well as where there is enough space for
your index. In a nutshell, you need to think
79 matches
Mail list logo