Erik Hatcher [EMAIL PROTECTED] wrote:
__
How proficient must I be in a language for which I wish to write the
stemmer?
I would venture to say you would need to be an expert in a language to
write a decent stemmer.
I'm sorry for a self-promo ;), but
the stemmer of egothor project can
Zilverline [EMAIL PROTECTED] wrote:
__
get more out of lucene, such as incremental indexing, to name one. On
Hello,
as far as I know, the incremental indexing
could be a real bottleneck if you implemented
your system without some knowledge
about Lucene internals.
The respective
Could an admin filter out hema's e-mails, please?
THX
Leo
[EMAIL PROTECTED] wrote:
Received your mail we will get back to you shortly
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL
Otis Gospodnetic napsal(a):
Thus I do not know how it could be O(1).
~ O(1) is what I have observed through experiments with indexing of
several million documents.
What did you exactly measured? Just the time of the insert operation
(incl. merge(), of course)? Was it a test on real
Otis Gospodnetic napsal(a):
--- Leo Galambos [EMAIL PROTECTED] wrote:
Otis Gospodnetic napsal(a):
Thus I do not know how it could be O(1).
~ O(1) is what I have observed through experiments with indexing of
several million documents.
What did you exactly measured
Have you tried a special add-on for pgsql -
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
Lucene is faster than tsearch (I hope so), but tsearch neednot be
synchronized with the main DB...up to you.
Cheers,
Leo
Ankur Goel wrote:
Hi,
I have to search the documents which are stored
Colin McGuigan wrote:
It creates an index, but when I search using
http://localhost:8000/luceneweb/
The page works but I do not get any replies.
Can it read your index? See indexLocation in configuration.jsp
1. How do you specify which directory is to be searched
snip
I agree with Erik,
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl
whatever you like.
API:
http://www.egothor.org/api/robot/
Console - demo (*.dundee.ac.uk):
http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F
Leo
Zhou, Oliver wrote:
I think it is common task
Really? And what model is used/implemented by Lucene?
THX
Leo
Otis Gospodnetic wrote:
Lucene does not implement vector space model.
Otis
--- [EMAIL PROTECTED] wrote:
Hi,
does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU
The model implies the quality, thus it does matter.
ad several important models) Are any of them implemented in Lucene?
Chong, Herb wrote:
does it matter? vector space is only one of several important ones.
Herb
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent
Marcel Stör wrote:
Hi
As everybody seems to be so exited about it, would someone please be so kind to explain
what document based clustering is?
Hi
they are trying to implement what you can see in the right panel here:
http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
They may also
Doug Cutting wrote:
Erik Hatcher wrote:
Yes, you're right. Getting the scores of a second query based on the
scores of the first query is probably not trivial, but probably
possible with Lucene. And that combined with a QueryFilter would do
the trick I suspect. Somehow the scores of the
Doug Cutting wrote:
I have some extensions to Lucene that I've not yet commited which make
it possible to easily define synthetic IndexReaders (not currently
supported). So you could do things that way, once I check these in.
But is this really better than just ANDing the clauses together?
Erik Hatcher wrote:
On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote:
And for the second time today QueryFilter. It allows narrowing
the documents queried to only the documents from a previous Query.
I guess, it would not be an ideal solution - the first query does two
But Drill Down searching is very desirable. It's where you're able to
search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list,
and only
returning results that are in both lists.
And for the second time
Isn't it better for Dan to skip the optimization phase before merging? I
am not sure, but he could save some time on this (if he has enough file
handles for that, of course). What strategy do you use in nutch?
THX
-g-
Doug Cutting wrote:
As the index grows, disk i/o becomes the bottleneck.
If I understand the Enigma code well, they say, that you must write a
crawler ;-)
-g-
To index the content of JSPs that a user would see using a Web browser,
you would need to write an application that acts as a Web client, in
order to mimic the Web browser behaviour. Once you have such an
Otis Gospodnetic wrote:
What interface do you need for Lucene? Will you use PUSH (=the robot
will modify Lucene's index) or PULL (=the engine will get deltas from
the robot) mode? Tell me what you need and I will try to do all my
best.
I'd imagine one would want to use it in the PUSH mode
? Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.
Thanks,
Otis
--- Leo Galambos [EMAIL PROTECTED] wrote:
Hi.
I would like to write $SUBJ (HCDC), because LARM does not offer many
I see. Are you looking for this:
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html
On the other hand, if n is not fixed, you still have a problem. As far
as I read this list it seems, that Lucene reads a dictionary (of terms)
into memory, and it also allocates
know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit.
Frank Burough
-Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 05, 2003 5:55 PM
To: Lucene Users List
Subject: Re: String similarity search vs
Ulrich Mayring wrote:
Hello,
does anyone know of good stopword lists for use with Lucene? I'm
interested in English and German lists.
What does mean ``good''? It depends on your corpus IMHO. The best way,
how one can get a ``good'' stop-list, is an analysis that's based on
idf. Thus, index
I'm sorry, I did not read the complete thread. Do you mean - analyzer ==
stemmer? Does it really work? If I was a stemmer, I would let searche
intact. ;-)
-g-
[EMAIL PROTECTED] wrote:
Hi Les,
We ended up modifying the QueryParser to pass prefix and suffix queries
through the Analyzer. For
Leo Galambos
[EMAIL PROTECTED]To: Lucene Users List
[EMAIL PROTECTED
.
Thanks,
Dario
- Original Message -
From: Leo Galambos [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, May 30, 2003 4:25 PM
Subject: Re: Search for similar terms
You need DASG+Lev over the dictionary. The boundary could be the highest
idf of the terms. It was solved
Adding a new document does not immediately modify an index, so the time
it takes to add a new document to an existing index is not proportional
to the index size. It is constant. The execution time of optimize()
is proportional to the index size, so you want to do that only if you
really
1. 2 threads per request may improve speed up to 50%
Hmm? Could you clarify? During indexing, multithreading may speed things
up (splitting docs to index in 2 or more sets, indexing separately, combining
indexing). But... isn't that a good thing? Or are you saying that it'd be good
to have
If I understand you correctly, then maybe you are not aware of
RemoteSearchable in Lucene.
That class cannot be used in Merger. RemoteSearchable is a class that
allows you to pass a query to another node, nothing less and nothing more
AFAIK.
This is the point that's more clear to me now.
On Tue, 4 Mar 2003, Otis Gospodnetic wrote:
Even if you could replace C:\. with http:// it wouldn't be a
good solution, as directory structures and file paths do not always map
directly to URLs.
Yes, but it is not the case of Samuel's configuration and 99.99% of
others.
The fact is,
org.apache.lucene.demo.IndexHTML wich was provided with the
documentation. Is there any problem using this demo class for a web
production site? I'm an application developer and it would be hard to
understand the hole lucene code to use it. It would be almost imposible
You can use it, but: if
Hi,
I was away and when I read what I missed, well...ehm... have you read
http://sustainability.open.ac.uk/gary/papers/netique.htm?
i.e., see Caution when quoting other messages while replying to them.
BTW: I would also vote for a strict standard, when Re: prefix must be
used in replies.
Just
On Sat, 1 Feb 2003, Rishabh Bajpai wrote:
also, i rememebr readin somewhere that one had to build the index in
some special way, but since you say no; i will take that. i anyways dont
rememebr where I read it, so no point asking about something if I am
myself not sure
I remember only one
Hi.
In this phrase word 'and' occurs which is a stop-word.
they may take AND as a keyword in a query. IMHO your query is taken as
boolean query.
I hope this helps.
-g-
--
To unsubscribe, e-mail: mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]
On Fri, 20 Dec 2002, Doug Cutting wrote:
The max a reader will keep open is:
mergeFactor * log_base_mergeFactor(N) * files_per_segment
A writer will open:
(1 + mergeFactor) * files_per_segment
I am not sure if you must open all files (i.e. writer would need just
2*f_p_s if you
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser
In case of (1) I could process about 300K of HTML documents. In case of
(2) more than 400K.
But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.
Is there anyone
I'm not sure this is a solution to your problem. However, it seems that the
HTMLParser used by the IndexHTML class has problems parsing the document
(there is a test class included in the jar):
java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote:
I'm using the class that Otis wrote (see message from about 3 weeks ago)
for testing the scalability of lucene (more results on that later) and I
May I ask you where one can get the source code? I cannot find it in
archive. Thank you
-g-
--
.ms.mff.cuni.cz/draw.png
Absolute values
If someone is able to say how often I would call optimize(), I can
recalculate the results. Now the 2nd round of tests is running (without
optimize()).
-g-
BTW: All figures, (C) 2002 Leo Galambos. Do not copy until I am sure that
the testsvalues
How does it affect overall performance, when I do not call optimize()?
THX
-g-
--
To unsubscribe, e-mail: mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]
2002, Otis Gospodnetic wrote:
This was just mentioned a few days ago. Check the archives.
Not needed for indexing, good to do after you are done indexing, as the
index reader needs to open and search through less files.
Otis
--- Leo Galambos [EMAIL PROTECTED] wrote:
How does it affect
40 matches
Mail list logo