Re: Writing a stemmer

2004-06-03 Thread Leo Galambos
Erik Hatcher <[EMAIL PROTECTED]> wrote:
__

>> How proficient must I be in a language for which I wish to write the 
>> stemmer?
>I would venture to say you would need to be an expert in a language to 
>write a decent stemmer.

I'm sorry for a self-promo ;), but
the stemmer of egothor project can be
adapted to any language, and you needn't be
a language expert. Moreover, the stemmer
achieves better F-measure than Porter's stemmers.

Cheers,
Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Tool for analyzing analyzers

2004-06-02 Thread Leo Galambos


Zilverline <[EMAIL PROTECTED]> wrote:
__

>get more out of  lucene, such as incremental indexing, to name one. On 

Hello,

as far as I know, the incremental indexing
could be a real bottleneck if you implemented
your system without some knowledge
about Lucene internals.

The respective test is here:
http://www.egothor.org/twiki/bin/view/Know/LuceneIssue

Cheers,
Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: thanks for your mail

2004-02-16 Thread Leo Galambos
Could an admin filter out hema's e-mails, please?

THX
Leo
[EMAIL PROTECTED] wrote:

Received your mail we will get back to you shortly

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

--- Leo Galambos <[EMAIL PROTECTED]> wrote:
 

Otis Gospodnetic napsal(a):

   

Thus I do not know how it could be O(1).
  

   

~ O(1) is what I have observed through experiments with indexing of
several million documents.
 

What did you exactly measured? Just the time of the insert operation 
(incl. merge(), of course)? Was it a test on real documents?
   

I didn't really measure anything, I only observed this, as my focus was
something else, not performance measurements.
It is true that every time an insert/add triggers a merge operation,
things will slow down, but from what I recall (and this was about 1
year ago), the overall performance was steady as the index grew.
 

Try the same test with mergeFactor=2, you will see the difference.

Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

Thus I do not know how it could be O(1).
   

~ O(1) is what I have observed through experiments with indexing of
several million documents.
 

What did you exactly measured? Just the time of the insert operation 
(incl. merge(), of course)? Was it a test on real documents?

THX
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index advice...

2004-02-10 Thread Leo Galambos
Otis Gospodnetic napsal(a):

Without seeing more information/code, I can't tell which part of your
system slows down with time, but I can tell you that Lucene's 'add'
does not slow over time (i.e. as the index gets larger).  Therefore, I
would look elsewhere for causes of the slowdown.
 

Otis, can you point me to some proofs that time of "insert" operation 
does not depend on the index size, please? Amortized time of "insert" is 
O(log(docsIndexed/mergeFac)), I think. Thus I do not know how it could 
be O(1).

Thank you.
Leo
AFAIK the issue with PDF files can be based on the PDF parser (I already 
encountered this with PDFbox).

The easiest thing to do is add logging to suspicious portions of the
code.  That will narrow the scope of the code you need to analyze.
Otis

--- [EMAIL PROTECTED] wrote:
 

Hey Lucene-users,

I'm setting up a Lucene index on 5G of PDF files (full-text search). 
I've 
been really happy with Lucene so far but I'm curious what tips and
strategies 
I can use to optimize my performance at this large size.

So far I am using pretty much all of the defaults (I'm new to
Lucene).
I am using PDFBox to add the documents to the index.
I can usually add about 800 or so PDF files and then the add loop:
for ( int i = 0; i < fileNames.length; i++ ) {
	Document doc = IndexFile.index(baseDirectory+documentRoot+"fileNames
[i]); 
	writer.addDocument(doc);
}

really starts to slow down.  Doesn't seem to be memory related.
Thoughts anyone?
Thanks in advance,
CK Hill


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene with Postgres db

2004-02-01 Thread Leo Galambos
Have you tried a special add-on for pgsql - 
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

Lucene is faster than tsearch (I hope so), but tsearch neednot be 
synchronized with the main DB...up to you.

Cheers,
Leo
Ankur Goel wrote:

Hi,

I have to search the documents which are stored in postgres db. 

Can someone give a clue how to go about it?

Thanks

Ankur Goel
Brickred Technologies
B-2 IInd Floor, Sector-31
Noida,India
P:+91-1202456361
C:+91-9810161323
E:[EMAIL PROTECTED]
http://www.brickred.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IndexHTML example on Jakarta Site

2004-01-02 Thread Leo Galambos
Colin McGuigan wrote:

It creates an index, but when I search using
http://localhost:8000/luceneweb/
The page works but I do not get any replies.
 

Can it read your index? See indexLocation in configuration.jsp

1. How do you specify which directory is to be searched

 

I agree with Erik, that you would rather use an application which is 
ready for use in a minute. IMHO Lucene is library/API and unless you are 
a JAVA developer, it does not fit your needs. Some applications are 
listed here:
http://dmoz.org/Computers/Programming/Languages/Java/Server-Side/Search_Engines/
Omit the Lucene link, else you will be in an endless loop... ;-)

If you must use Lucene, try to find something for you here:
http://jakarta.apache.org/lucene/docs/powered.html
You may be interested in i2a, but their demo (@24.9.177.111) is dead 
right now.

Cheers,
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: What about Spindle

2003-12-03 Thread Leo Galambos
You can try Capek (needs JDK1.4, because it uses NIO). It can crawl 
whatever you like.

API:
http://www.egothor.org/api/robot/
Console - demo (*.dundee.ac.uk):
http://www.egothor.org/egothor/index.jsp?q=http%3A%2F%2Fwww.compbio.dundee.ac.uk%2F
Leo

Zhou, Oliver wrote:

I think it is common task to index a jsp based web site.  A lot of poeple
ask how to do so on this mailing list.  However, Lucene does not have a
ready to use web crawler.  My question is that has anybody used Spindle to
index a jsp based web site or is there any other tools out there.
Thanks,
Oliver


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, December 03, 2003 11:25 AM
To: Lucene Users List
Subject: Re: What about Spindle
You should ask Spindle author(s).  The error doesn't look like
something that is related to Lucene, really.
Otis

--- "Zhou, Oliver" <[EMAIL PROTECTED]> wrote:
 

What about Spindle? Has anybody used it to crawle a jsp based web
site? Do I
need to intall listlib.jar to do so? 

I got error message "Jsp Translate:Unable to find setter method for
attribue:class" when I tried to run listlib-example.jsp in wsad.
Thanks,
Oliver




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Free Pop-Up Blocker - Get it now
http://companion.yahoo.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
The model implies the quality, thus it does matter.

ad "several important models") Are any of them implemented in Lucene?

Chong, Herb wrote:

does it matter? vector space is only one of several important ones.

Herb

-Original Message-----
From: Leo Galambos [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2003 4:00 AM
To: Lucene Users List
Subject: Re: Vector Space Model in Lucene?
Really? And what model is used/implemented by Lucene?

THX
Leo
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Vector Space Model in Lucene?

2003-11-14 Thread Leo Galambos
Really? And what model is used/implemented by Lucene?

THX
Leo
Otis Gospodnetic wrote:

Lucene does not implement vector space model.

Otis

--- [EMAIL PROTECTED] wrote:
 

Hi,

does Lucene implement a Vector Space Model? If yes, does anybody have
an
example of how using it?
Cheers,
Ralf
--
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService
Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Document Clustering

2003-11-11 Thread Leo Galambos
Marcel Stör wrote:

Hi

As everybody seems to be so exited about it, would someone please be so kind to explain 
what "document based clustering" is?
 

Hi

they are trying to implement what you can see in the right panel here:
http://www.egothor.dundee.ac.uk/egothor/q2c.jsp?q=protein
They may also analyze identical pages (hit #9 and #10) - this could be 
also taken as "clustering" AFAIK.

For instance, Doug wrote some papers about clustering (if I remember it 
correctly) - see his bibliography.

Leo

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: derive tokens from single token

2003-09-29 Thread Leo Galambos
what about bi/tri-grams + some sort of hit filtering? It will do the 
job. I just saw some ineffective implementation of 1-grams for CJK on 
[EMAIL PROTECTED] It could be a good starting point for full n-gram 
support... Just a thought.

Leo

Erik Hatcher wrote:

On Monday, September 29, 2003, at 11:01  AM, Hackl, Rene wrote:

except that you'll be indexing a ton of
terms I'd guess.  If there is some other way to split these words by
separating by prefix ("hexa", "hepta") and suffix ("alene", "alin") it
would likely be better.  But maybe its not practical to do so.


There'll be at least two indexes, one "normal" one and another 
bloated one.
Dan suggested splitting, too, but, unfortunately, if users search for 
e.g.

"9-Oxabicyclo[3.3.1]nona-2,6-diene"

they don't want anything else than that substance, as opposed to

"*-Oxabicyclo[3.3.1]nona*"

where they'd be interested in substances from that family - whatever the
numbers are.


But consider the same type of thing like a phrase query.  If two 
documents are indexed with a field containing "a b c" and "x b y"

If searching for "b" is done, both documents are returned.  If 
searching for "a b" is done, then only the first document is 
returned.  So I think with a domain aware analyzer, you might be able 
to split things up into separate terms and then on the querying side 
of things the same type of analysis would be done  Certainly not a 
trivial thing, and maybe not even the right approach, but it seems 
that intelligent analysis can make things a lot easier on users and 
performance for searching.  Maybe?  Just food for thought.

If you're interested, once I've some hard performance results at hand, I
could post them around.


Definitely interested!

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote:

I have some extensions to Lucene that I've not yet commited which make 
it possible to easily define synthetic IndexReaders (not currently 
supported).  So you could do things that way, once I check these in. 
But is this really better than just ANDing the clauses together?  It 
would take some big experiments to know, but my guess is that it 
doesn't make much difference to compute a "local" IDF for such things.


In this case, I think that the operator would be evaluated as "an 
implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). 
Obviously, you have to use an filter to filter out false hits (in case 
of q1->q2, the formula is true when q1 is false, so it is not what you 
really need), but it is not an issue with the auxiliary index. On the 
other hand, it is a feeling and it needs a test, you are right.

Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-11 Thread Leo Galambos
Doug Cutting wrote:

Erik Hatcher wrote:

Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores 
of the second query.


Why not just AND together the first and second query?  That way 
they're both incorporated in the ranking.  Filters are good when you 
don't want it to affect the ranking, and also when the first query is 
a criterion that you'll reuse for many queries (e.g., 
language=french), since the bit vectors can be cached (as by 
QueryFilter).


You probably missed the start of our discussion - we are talking about 
this: "q1 -> q2" which means "NOT q1 OR q2", versus "q2 -> q1" which 
means "q1 OR NOT q2". It causes the issue, and it also shows why you 
cannot use the simple "AND", because "q1 AND q2" != "NOT q1 OR q2" != 
"q1 OR NOT q2".

Leo

BTW: I didn't see the logic formulas for many years, so it is without 
any guarantee ;-)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-07 Thread Leo Galambos
Erik Hatcher wrote:

On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:

And for the second time today QueryFilter.  It allows narrowing 
the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a 
relevance to each document of this subset. Your solution omits the 
second point. It implies, the solution will not return "good hit 
lists", because you will not consider the information value of the 
first query which was given to you by a user.


Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores of 
the second query.


Well, I do not want to be a pessimist, but the boost vector is not a 
good solution due to CWI statistics. On the other hand, it is much 
better than the simple QueryFilter which, in fact, works as 0/1 boost.

Example: I use this notation: inverted_list_term:{list of W values, "-" 
denotes W=0, for 12 documents in a collection}
A:{23[16]--27}
B:{--[38]}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets - 
namely, the 3rd and 4th doc) is selected, and if your second query is "A 
C", then you cannot use global IDFs, because in the subset, the IDF 
factors are different. Globally, A is better distriminator, but in the 
subset, C is better. This fact is then reflected by the hit list you 
generate, and I guess, the quality will be also affected by this.

The example shows, that you would rather export the subset to an 
auxiliary index (RAMDirectory?) and then use this structure instead of 
the original index. Obviously, it will solve the issue of speed you 
mentioned.

Unfortunately, I am not sure, if you can export the inverted lists when 
you read them. In egothor, I would use a listener in Rider class, in 
Lucene, I would have to rewrite some classes and it could be a real 
problem. Maybe, there is a solution I do not see...

Your turn ;-)
Cheers,
Leo
Am I off base here?

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the 
commercial packages he may get. He could use a different model where 
"AND" is not an associative operator (i.e. some modification of the 
extended Boolean model). It implies, he would implement it in 
Similarity.java (if I remember that class name correctly).


Right... but you'd still need the filtering capability as well, I 
would think - at least for performance reasons.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene features

2003-09-05 Thread Leo Galambos

But Drill Down searching is very desirable. It's where you're able to 
search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, 
and only
returning results that are in both lists.


And for the second time today QueryFilter.  It allows narrowing 
the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a relevance 
to each document of this subset. Your solution omits the second point. 
It implies, the solution will not return "good hit lists", because you 
will not consider the information value of the first query which was 
given to you by a user.

For instance, "neologism" > "George Bush" (1st>2nd query) would return 
different order of hits than "George Bush" > "neologism". Other 
examples, "Prague Berlin" > "flight" (I must go there, and I prefer an 
airplane) versus "flight" > "Prague Berlin" (I must fly, and I prefer 
Berlin).

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the commercial 
packages he may get. He could use a different model where "AND" is not 
an associative operator (i.e. some modification of the extended Boolean 
model). It implies, he would implement it in Similarity.java (if I 
remember that class name correctly).

Leo



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Fastest batch indexing with 1.3-rc1

2003-08-20 Thread Leo Galambos
Isn't it better for Dan to skip the optimization phase before merging? I 
am not sure, but he could save some time on this (if he has enough file 
handles for that, of course). What strategy do you use in "nutch"?

THX

-g-

Doug Cutting wrote:

As the index grows, disk i/o becomes the bottleneck.  The default 
indexing parameters do a pretty good job of optimizing this.  But if 
you have lots of CPUs and lots of disks, you might try building 
several indexes in parallel, each containing a subset of the 
documents, optimize each index and finally merge them all into a 
single index at the end. But you need lots of i/o capacity for this to 
pay off.

Doug

Dan Quaroni wrote:

Looks like I spoke too soon... As the index gets larger, time to merge
becomes prohibitably high.  It appears to increase linearly.
Oh well.  I guess I'll just have to go with about 3ms/doc.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How can I index JSP files?

2003-07-27 Thread Leo Galambos
If I understand the Enigma code well, they say, that you must write a 
crawler ;-)

-g-

"To index the content of JSPs that a user would see using a Web browser,
you would need to write an application that acts as a Web client, in
order to mimic the Web browser behaviour. Once you have such an
application, you should be able to point it to the desired JSP, retrieve
the contents that the JSP generates, parse it, and feed it to Lucene."






I am a newbie to lucene and I would like to enable searching capability
to my website which is written entirely with JSP and servlets.  Does
anyone have any experience parsing JSP files in order to create in index
for/by Lucene?   I would greatly appreciate any help with the matter.
THanx


Russ

 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: commercial websites powered by Lucene?

2003-06-25 Thread Leo Galambos


BUT, looking at the Full text indexing/searching part...it not up to snuff.

Currently, I'm using mysql's full text search support. I have a database of
3-5 million rows. Each row is unique, let's say a product. Each row has
several columns, but the two I search on are title and description. I
created a full text index on title and description. Title has approximately
100 characters, and description has 255 characters.
store the two columns in an extra table. it would help you.

At the moment, mysql is taking 50 seconds plus to return results on simple
one word searches. My dedicated server is a P4, 2.0 Gighz, 1.5 Gig RAM
RedHat Linux 7.3 platform, with nothing else running on it, i.e. another
server is handling HTTP requests. It is a dedicated mysql box.  In addition,
I'm the only person making queries.
 

did you write it to mysql team?

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: High Capacity (Distributed) Crawler

2003-06-10 Thread Leo Galambos
Otis Gospodnetic wrote:

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from

the robot) mode? Tell me what you need and I will try to do all my
best.
   

I'd imagine one would want to use it in the PUSH mode (e.g. the crawler
fetches a web page and adds it to the searchable index).
How does PULL mode work?  I've never heard of web crawlers being used
in the PULL mode.  What exactly does that mean, could you please
describe it?
 

It is a long story, so I will assume, that everything runs on a single 
box - it is the most simple case.
"[x]" will denote points, where Lucene may have problems with a fast 
implementation, I guess.

Crawler: The crawler stores meta and body of all documents. If you want 
to retrieve the document meta or body (knowing its URI), it costs O(1) 
(2 seeks and 2 read requests in auxiliary data structures). After this 
retrieval you also get a direct handle to meta and body - then the price 
of retrieval becomes O(1), but no extra seeks in any structures. The 
handle is persistent and is related to URI. The meta and body is updated 
as soon as the crawler fetches a new fresh copy.

Engine: engine stores the handle for each document. Moreover it knows 
the last (highest) handle, which is stored in the main index. So the 
trick is this:
1) build up an auxiliary index from new documents. The new documents are 
documents which have their handle greater than the last handle which is 
known to the engine, thus you can iterate them easily - this process can 
run in a separate thread
2) consult the changes. You read meta, which are stored in index, and 
test if they are obsolete (note: you have already got the handle, so it 
smokes). If so, you denote the respective document as "deleted" and its 
new version (if any) is appended to another index - the index of 
changes. The insertion to the index runs in a separate thread, so the 
main thread is not blocked. BTW: [x] The documents, which are not 
modified, may modify their ranks (depthrank, pagerank, frequencyrank 
etc) in this round.

[x] The two auxiliary indices are then merged with the main index.

Obviously, the weak point is the test if anything is changed. This can 
be easily solved with the index dynamization I use. Despite Lucene, I 
order barrels (segments in your terminology) by their size. I do not 
want to describe all the details - I hate long e-mails ;-), but the 
dynamization guarantees that:
a) the query time is never worse than 8x, comparing with 
fully-optimalized index (if you buy 8x faster HW, you overcome this easily)
b) the documents, which are often modified, are stored in small barrels 
of the main index. It means, that their actualization is fast.

So, I process only the small barrels once a day, and the larger ones 
less often. If we say, that 5M of docs are updated daily, PULL mode can 
handle this load in few minutes. Unfortunately, the slowest point is the 
HTML parser which may run few hours :-(.

If you want to actualize other 10^10 crap pages once a month, it can be 
done too, but it is out of my first assumption above ;-).

-g-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: High Capacity (Distributed) Crawler

2003-06-09 Thread Leo Galambos
Hi Otis.

The first beta is done (without NIO). It needs, however, further 
testing. Unfortunatelly, I could not find enough servers which I may hit.

I wanted to commit the robot as a part of egothor (it will use it in 
PULL mode), but we have a nice weather here, so I lost any motivation to 
play with PC ;-).

What interface do you need for Lucene? Will you use PUSH (=the robot 
will modify Lucene's index) or PULL (=the engine will get deltas from 
the robot) mode? Tell me what you need and I will try to do all my best.

-g-

Otis Gospodnetic wrote:

Leo,

Have you started this project?  Where is it hosted?
It would be nice to see a few alternative implementations of a robust
and scalable java web crawler with the ability to index whatever it
fetches.
Thanks,
Otis
--- Leo Galambos <[EMAIL PROTECTED]> wrote:
 

Hi.

I would like to write $SUBJ (HCDC), because LARM does not offer many 
options which are required by web/http crawling IMHO. Here is my
list:

1. I would like to manage the decision what will be gathered first - 
this would be based on pageRank, number of errors, connection speed
etc. 
etc.
2. pure JAVA solution without any DBMS/JDBC
3. better configuration in case of an error
4. NIO style as it is suggested by LARM specification
5. egothor's filters for automatic processing of various data formats
6. management of "Expires" HTTP-meta headers, heuristic rules which
will 
describe how fast a page can expire (.php often expires faster than
.html)
7. reindexing without any data exports from a full-text index
8. open protocol between the crawler and a full-text engine

If anyone wants to join (or just extend the wish list), let me know,
please.
-g-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



__
Do you Yahoo!?
Yahoo! Calendar - Free online calendar with sync to Outlook(TM).
http://calendar.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
I see. Are you looking for this: 
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/Similarity.html

On the other hand, if n is not fixed, you still have a problem. As far 
as I read this list it seems, that Lucene reads a dictionary (of terms) 
into memory, and it also allocates one file handle for each of the 
acting terms. It implies you would not break the terms up into n-grams 
and, as a result, you would use a slow look-up over the dictionary. I do 
not know if I express it correctly, but my personal feeling is, that you 
would rather write your application from scratch.

BTW: If you have "nice terms", you could find all their n-grams 
occurencies in the dictionary, and compute a boost factor for each of 
the inverted lists. I.e., "bbc" is a term in a query, and for i-list of 
"abba", the factor is 1 (bigram "bb" is there), for i-list of "bbb", the 
factor is 2 ("bb" 2x). Then you use the Similarity class, and it is 
solved. Nevertheless, if the n-grams are not nice and the query is long, 
you will lost a lot of time in the dictionary look-up phase.

-g-

PS: I'm sorry for my English, just learning...

Jim Hargrave wrote:

Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting.

Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms.

Jim

 

[EMAIL PROTECTED] 06/05/03 03:55PM >>>
   

AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

 

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 

   



-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 





--
This message may contain confidential information, and is intended only for the use of 
the individual(s) to whom it is addressed.
==

 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Where to get stopword lists?

2003-06-06 Thread Leo Galambos
Ulrich Mayring wrote:

Hello,

does anyone know of good stopword lists for use with Lucene? I'm 
interested in English and German lists.
What does mean ``good''? It depends on your corpus IMHO. The best way, 
how one can get a ``good'' stop-list, is an analysis that's based on 
idf. Thus, index your documents, list all the terms with low idf out, 
save them in a file and use them in next indexing round.

Just a thought...

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: String similarity search vs. typcial IR application...

2003-06-06 Thread Leo Galambos
Exact matches are not ideal for DNA applications, I guess. I am not a 
DNA expert, but those guys often need a feature that is termed 
``fuzzy''[*] in Lucene. They need Levenstein's and Hamming's metrics, 
and I think that Lucene has many drawbacks which disallow effective 
implementations. On the other hand, I am very interested in a method you 
mentioned. Could you give me a reference, please? Thank you.

-g-

[*] why do you use the label ``fuzzy''? It has nothing to do with fuzzy 
logic or fuzzy IR, I guess.

Frank Burough wrote:

I have seen some interesting work done on storing DNA sequence as a set of common patterns with unique sequence between them. If one uses an analyzer to break sequence into its set of patterns and unique sequence then Lucene could be used to search for exact pattern matches. I know of only one sequence search tool that was based on this approach. I don't know if it ever left the lab and made it into the mainstream. If I have time I will explore this a bit.

Frank Burough



 

-----Original Message-
From: Leo Galambos [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 05, 2003 5:55 PM
To: Lucene Users List
Subject: Re: String similarity search vs. typcial IR application...

AFAIK Lucene is not able to look DNA strings up effectively. 
You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

   

Our application is a string similarity searcher where the 
 

query is an 
   

input string and we want to find all "fuzzy" variants of the 
 

input string in the DB.  The Score is basically dice's 
coefficient: 2C/Q+D, where C is the number of terms (n-grams) 
in common, Q is the number of unique query terms and D is the 
number of unique document terms. Our documents will be sentences.
   

I know Lucene has a fuzzy search capability - but I assume 
 

this would 
   

be very slow since it must search through the entire term 
 

list to find candidates.
   

In order to do the calculation I will need to have 'C' - the 
 

number of 
   

terms in common between query and document. Is there an API 
 

that I can call to get this info? Any hints on what it will 
take to modify Lucene to handle these kinds of queries?
   



 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search for similar terms

2003-05-31 Thread Leo Galambos
http://cs.felk.cvut.cz/psc/members.html
http://cs.felk.cvut.cz/psc/event/1998/p13.html
or contact prof. Melichar for more details:
http://webis.felk.cvut.cz/people/melichar.html
-g-

Dario Dentale wrote:

Hi,
can you suffer me a link with an overview document of this method?
I couldn't find.
Thanks,
Dario
- Original Message - 
From: "Leo Galambos" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, May 30, 2003 4:25 PM
Subject: Re: Search for similar terms

 

You need DASG+Lev over the dictionary. The boundary could be the highest
idf of the terms. It was solved by prof. Melichar, you can find the
construction of the automaton in his papers.
-g-

Dario Dentale wrote:

   

Hi,
anybody knows which is the best way to implements in Lucene a
 

fuctionality
 

(that Google has) like this:

Search text-> notebok

Answer-> Did you mean: notebook ?

Thanks,
Dario
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lowercasing wildcards - why?

2003-05-31 Thread Leo Galambos
Ah, I got it. THX. In the good old days, the wildcards were used as a 
fix for missing stemming module. I am not sure if you can combine these 
two opposite approaches successfully. I see the following drawbacks of 
your solution.

Example:
built* (->built) could be changed to build* (no built, but ->builder, 
building, etc.), and precision will go down drastically.

You probably use a stemmer with one important bug (a.k.a. feature) - 
overstemming, so here is another example:
political* (->political, politically) is transformed to polic* 
(->policer, policy, policies, policement etc.) by Porter alg., and the 
precision is again affected drastically

-g-

[EMAIL PROTECTED] wrote:

Your analyzers can optionally incorporate stemming, along with the other
things that analyzers do (lowercasing, etc...).  The stemming algorithms
are all different.  This "searcher" example was made up, but, there are
instances where stemming at index time and not stemming wildcard searches
will result in lost hits.  Specifically, we encountered this situation
using the optional Snoball analyzers (which work great, by the way).
DaveB



 
     Leo Galambos
 <[EMAIL PROTECTED]>To:   Lucene Users List
   <[EMAIL PROTECTED]>  
 05/30/03 10:26 AMcc:
 Please respond toSubject:  Re: Lowercasing wildcards - why? 
 "Lucene Users   
 List"   
 
 



I'm sorry, I did not read the complete thread. Do you mean - analyzer ==
stemmer? Does it really work? If I was a stemmer, I would let "searche"
intact. ;-)
-g-

[EMAIL PROTECTED] wrote:

 

Hi Les,

We ended up modifying the QueryParser to pass prefix and suffix queries
through the Analyzer.  For us, it was about stemming.  If you decide to
   

use
 

an analyzer that incorporated stemming, there are cases where wildcard
queries will not return the expected results.
Example:  "searcher" will probably get stemmed to "search".  A search on
"searche*" should hit the term "searcher", but, it won't, all instances of
"searcher" having been stemmed to "search" at index time.  Our solution
   

was
 

to remove the trailing wildcard and send "searche" to the analyzer, then
tack the wildcard character back on there and create the PrefixQuery
   

object
 

with the new search string "search*".

DaveB





   

 

Leslie Hughes
   

 

<[EMAIL PROTECTED]To:
   

"'[EMAIL PROTECTED]'"
 

ion.com.au>
   

<[EMAIL PROTECTED]>
 

cc:
   

 

05/30/03 01:09 AM   Subject:
   

Lowercasing wildcards - why?
 

Please respond to "Lucene
   

 

Users List"
   

 

 

 



Hi,

I was just wondering what the rationale is behind lowercasing wildcard
queries produced by QueryParser? It's just that my data is all upper case
and my analyser doesn't lowercase so it seems a bit odd that I have to
   

call
 

setLowercaseWildcardTerms(false). Couldn't queryparser leave the terms
unnormalised or better still pass them through the analyser?
I'm sure there's a good reason for it though.

Les



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


   





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lowercasing wildcards - why?

2003-05-31 Thread Leo Galambos
I'm sorry, I did not read the complete thread. Do you mean - analyzer == 
stemmer? Does it really work? If I was a stemmer, I would let "searche" 
intact. ;-)

-g-

[EMAIL PROTECTED] wrote:

Hi Les,

We ended up modifying the QueryParser to pass prefix and suffix queries
through the Analyzer.  For us, it was about stemming.  If you decide to use
an analyzer that incorporated stemming, there are cases where wildcard
queries will not return the expected results.
Example:  "searcher" will probably get stemmed to "search".  A search on
"searche*" should hit the term "searcher", but, it won't, all instances of
"searcher" having been stemmed to "search" at index time.  Our solution was
to remove the trailing wildcard and send "searche" to the analyzer, then
tack the wildcard character back on there and create the PrefixQuery object
with the new search string "search*".
DaveB




 Leslie Hughes  
 <[EMAIL PROTECTED]To:   "'[EMAIL PROTECTED]'"   
 ion.com.au>  <[EMAIL PROTECTED]>  
 cc:
 05/30/03 01:09 AM   Subject:  Lowercasing wildcards - why? 
 Please respond to "Lucene  
 Users List"





Hi,

I was just wondering what the rationale is behind lowercasing wildcard
queries produced by QueryParser? It's just that my data is all upper case
and my analyser doesn't lowercase so it seems a bit odd that I have to call
setLowercaseWildcardTerms(false). Couldn't queryparser leave the terms
unnormalised or better still pass them through the analyser?
I'm sure there's a good reason for it though.

Les



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search for similar terms

2003-05-31 Thread Leo Galambos
You need DASG+Lev over the dictionary. The boundary could be the highest 
idf of the terms. It was solved by prof. Melichar, you can find the 
construction of the automaton in his papers.

-g-

Dario Dentale wrote:

Hi,
anybody knows which is the best way to implements in Lucene a fuctionality
(that Google has) like this:
Search text-> notebok

Answer-> Did you mean: notebook ?

Thanks,
Dario
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: I: incremental index

2003-03-28 Thread Leo Galambos
> Adding a new document does not immediately modify an index, so the time
> it takes to add a new document to an existing index is not proportional
> to the index size.  It is constant.  The execution time of optimize()
> is proportional to the index size, so you want to do that only if you
> really need it.  The Lucene article on http://www.onjava.com/ from
> March 5th describes this in more detail.

Otis,

I am not sure, if anything about constants is constant in non-constant IR 
systems :-)

I think, that the correct answer is O(t/k*(1+log_m(k)), where t is a time
you need to create&write one monolithic segment of k documents, m is
merge factor you use, and k is the number of documents which are already
in index. As you can see, the function grows with k.

Can you explain me, why addition of one document takes constant time?

Thank you

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential Lucene drawbacks

2003-03-08 Thread Leo Galambos
On Fri, 7 Mar 2003, Andrzej Bialecki wrote:

> In my experience, for creating class diagrams tools like TogetherJ do 
> acceptable job when used to automatically reverse-engineer existing 
> source code. But in the case of sequence diagrams they are just 
> pathetic... You'll have a chance to see two of them in the package I 
> sent to Otis. :-)

I did not see your diagrams yet (I missed the URL IMHO), but I think that
collaboration, activity and sequence diagrams would be better. Can they be
produced by the tool you use?

Thank you.

-g-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential Lucene drawbacks

2003-03-07 Thread Leo Galambos
> I believe there are tools out there that will analyze Java sources and
> create UML class diagrams from that.  I believe TogetherJ or one of
> those 'all in one' tools can do that.

It is not a good way, because such diagrams contain a lot of dependencies
which are not in the ``original'' diagrams. Moreover the tool cannot
recognize what objects are important and what objects would be excluded
from the diagrams.

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential Lucene drawbacks

2003-03-06 Thread Leo Galambos
> > That class cannot be used in Merger. RemoteSearchable is a class that
> > allows you to pass a query to another node, nothing less and nothing
> > more
> > AFAIK.
> 
> What is Merger?  Verb, noun, an IR concept, a name of the product or
> project?  Merging of results from multiple searchers from multiple
> indices?

Ooops. SegmentMerger, the central class in org.apache.lucene.index.

> That is the difference between a simple library and a targeted
> application.

Right. On the other hand, when you want to use the library for such
application, it must allow you the things.

> > Moreover, I think that Lucene can do much more than you think Otis
> > :). 
> > Egothor can do that, so why not Lucene?
> Yes, Lucene can do more than I think it can, why not.
> Maybe this is being done already...with Lucene... ;)

...and that is why I would like to see the object model (UML+notes). In
the model we can find the answer if Lucene can do more than we think :).
The point, where I am lost, is Searchable (and subclasses). Have you not
already written a paper about it?

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Potential Lucene drawbacks

2003-03-06 Thread Leo Galambos
> If I understand you correctly, then maybe you are not aware of
> RemoteSearchable in Lucene.

That class cannot be used in Merger. RemoteSearchable is a class that
allows you to pass a query to another node, nothing less and nothing more
AFAIK.

> This is the point that's more clear to me now.  There is confusion
> about what Lucene is and what it is not.  Lucene does not even try to
> be what those services you mentioned are.  Their goals are different,
> they are a different set of tools.  Lucene's focus is on indexing text
> and searching it.  It is not a tool to query other existing search

I do not think so. It is all about the object model you use. If you are
not able to solve the simplest case, how can you distribute the engine
across the network? I do not mean the simple RMI gateways which marshall
parameters and send them through a network pipe, I mean the true system
that could beat google (and it is another topic...).

Moreover, I think that Lucene can do much more than you think Otis :). 
Egothor can do that, so why not Lucene?

-g-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding Setup Lucine for my site

2003-03-06 Thread Leo Galambos
> > 1. 2 threads per request may improve speed up to 50%
> Hmm? Could you clarify? During indexing, multithreading may speed things
> up (splitting docs to index in 2 or more sets, indexing separately, combining
> indexing). But... isn't that a good thing? Or are you saying that it'd be good 
> to have multi-threaded search functionality for single search? (in my 
> experience searching is seldom the slow part)

you may improve indexing and searching. Indexing, because the merge
operation will lock just one thread and smaller part of an index while
other threads are still working;  searching, because you can distribute
the query to more barrels. In both cases you save up to 50% of time (I
assume mergefactor=2).

> > 2. Merger is hard coded
> 
> In a way that is bad because... ?
> (ie. what is the specific problem... I assume you mean index merging
> functionality?)

Because you cannot process local and/or remote barrels -- all must be
local in Lucene object model. That is the serious bug IMHO.

> > 4. you cannot implement dissemination + wrappers for internet servers
> > which would serve as static barrels.
> Could you explain this bit more thoroughly (or pointers on longer 
> explanation)?

Read more about dissemination, metasearch engines (i.e. Savvysearch),
dDIRs (i.e. Harvest). BTW, let's go to a pub and we can talk til morning
:) (it is a serious offer, because I would like to know more about IR).

This example is about metasearch (the simplest case of dDIRs): Can Lucene
allow that a barrel (index segment?) is static and a query is solved via
wrapper, that sends the query ${QUERY} to
http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=${QUERY} and then
reads the HTML output as a result?

> > 5. Document metadata cannot be stored as a programmer wants, he must
> > translate the object to a set of fields
> Yes? I'd think that possibility of doing separate fields is a good thing; 
> after all, all a plain text search engine needs to provide (to be considered 
> one) is indexing of plain text data, right?

I talked about metadata. When metadata object knows how to achieve its 
persistence, why would one translate anything to fields and then back?
Why would you touch the users metadata at all? You need flat fields for
indexing, and what's around -- it is not your problem :). Lucene is
something between CMS and CIS, you say that it's closer to CIS, but when
you need metadata in fields, you are closer to CMS IMHO.

> > 6. Lucene cannot implement your own dynamization
> 
> (sorry, I must sound real thick here).
> Could you elaborate on this... what do you mean by dynamization?

Read more about "Dynamization of Decomposable Searching Problems".

-g-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding Setup Lucine for my site

2003-03-05 Thread Leo Galambos
> > On the other hand, if you extend Lucene with your hacks, you will
> > find out
> > that the model of Lucene is unknown and many parts are hard-coded. It
> > boosts speed, but it disallows future enhancements (I could name the
> > parts, I hope we do not start flamewar here).
> 
> I'm all eyes and I'm a serious grown-up with good manners :)
> Constructive suggestions for improvement are always welcome.

1. 2 threads per request may improve speed up to 50%

2. Merger is hard coded

3. you cannot use different inverted lists in one index (i.e. pagerank and
doc_id instead of doc_id/prox_handle/freq/...), inverted lists do not
support multilevel skips (see MoZo papers about this topic)

4. you cannot implement dissemination + wrappers for internet servers 
which would serve as static barrels.

5. Document metadata cannot be stored as a programmer wants, he must
translate the object to a set of fields

6. Lucene cannot implement your own dynamization

etc.

-g-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding Setup Lucine for my site

2003-03-05 Thread Leo Galambos
> org.apache.lucene.demo.IndexHTML wich was provided with the
> documentation. Is there any problem using this demo class for a web
> production site? I'm an application developer and it would be hard to
> understand the hole lucene code to use it. It would be almost imposible

You can use it, but: if you need something special (snippets, coloring,
different URL mapping, handling of your local charset, etc. etc.) you must
include code from sandbox or write it from scratch AFAIK.

> for my develop phase timings to try to do this. * Regarding you comment:
> Lucene does not index web pages. I thougth lucene main goal was to index
> web pages ¿? and as an after thougth it should be able to index text
> files or some other information (for example mail databases). Regards

Lucene *can* index HTML pages, if you use programs which build Lucene 
index from HTML documents. The programs exist.

On the other hand, if you extend Lucene with your hacks, you will find out
that the model of Lucene is unknown and many parts are hard-coded. It
boosts speed, but it disallows future enhancements (I could name the
parts, I hope we do not start flamewar here).

> and thanks for your comments!!! I'm considering egothor search
> engine. I succesfully set a web application for searching my web site
> but I didn't see a mailing list or a forum with the level of

I had PhD exam, and many questions went throught ICQ, you know, it is 
faster for me than e-mails...

-g-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding Setup Lucine for my site

2003-03-05 Thread Leo Galambos
On Tue, 4 Mar 2003, Otis Gospodnetic wrote:

> Even if you could replace C:\. with http:// it wouldn't be a
> good solution, as directory structures and file paths do not always map
> directly to URLs.

Yes, but it is not the case of Samuel's configuration and 99.99% of 
others.

The fact is, that Lucene is only a library, and sandbox utilities which
are of different quality. :-)

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



A thought: netique

2003-02-28 Thread Leo Galambos
Hi,

I was away and when I read what I missed, well...ehm... have you read 
http://sustainability.open.ac.uk/gary/papers/netique.htm?

i.e., see "Caution when quoting other messages while replying to them".

BTW: I would also vote for a strict standard, when "Re:" prefix must be
used in replies.

Just a thought.

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Wildchar based search?? |

2003-02-02 Thread Leo Galambos
On Sat, 1 Feb 2003, Rishabh Bajpai wrote:

> also, i rememebr readin somewhere that one had to build the index in
> some special way, but since you say no; i will take that. i anyways dont
> rememebr where I read it, so no point asking about something if I am
> myself not sure

I remember only one problem that is related to indexing phase - it is 
``optimize'' function. If you update your index, one cannot tell you if 
you must also call optimize() or not.

If you do not call it, it may slow down queries (I do not know how much,
but Otis told it). If you call it, it slows down the indexing phase (I
have tested it and it is significant).

AFAIK Lucene cannot tell you when the index becomes dirty so that you must 
call optimize. On the other hand it does not affect small indexes, where 
optimize() costs nothing.

Otis, I think that this still holds. Right?

-g-



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Stop-word in phrase (BUG?)

2003-01-27 Thread Leo Galambos
Hi.

> In this phrase word 'and' occurs which is a stop-word.

they may take AND as a keyword in a query. IMHO your query is taken as 
boolean query.

I hope this helps.

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Computing Relevancy Differently

2003-01-27 Thread Leo Galambos
> What's next? Seems that I'm getting a message: "Figure it out on your own,
> you dummy."

And from your letter I understood that you want someone to do your 
homework (for nothing). :) Right?

One would say: ask not what others can do for you, ask what you can do for
them.

> >you must understand what you are doing
> 
> ==>which I don't, as I've already stated several times.

It is hard for me to tell anything. AFAIK (friendly speaking) Lucene does
not offer click-click interface...

> >and you must change similarity calculations.
> 
> ==>which means what? Is that part of Lucene?

Doug told it few days ago, I hope it is still in Similarity.java file.

> >AFAIK you would set the normalization factor to a constant value (1.0 or
> so).
> 
> ==>Does this mean not to use boost?

I am not God. The final decision is yours.

> ==>I didn't know I was excluding one for the other.  Do I interpret all this
> to mean Lucene can't be adjusted to do what I was asking?  That it's too
> complicated?

It means, Lucene offers much more than you want => you can use a simpler 
package that can be configured faster. I.e. UdmSearch uses a simple SQL 
query...

-g-



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Computing Relevancy Differently

2003-01-26 Thread Leo Galambos
> What I'd like to do is get a relevancy-based order in which (a) longer
> documents tend to get more weight than shorter ones, (b) a document body
> with 'X' instances of a query term gets a higher ranking than one with fewer
> than 'X' instances. and (c) a term found in the headline (usually in
> addition to finding the same term in the body) is more highly ranked than
> one with the term only in the body.
> 
> But that's not what happens with the default scoring, and I'd like to change
> that.

I am not Lucene developer, but:

1) Lucene uses the Vector model, if you want to use different model you 
must understand what you are doing and you must change similarity 
calculations. AFAIK you would set the normalization factor to a constant 
value (1.0 or so).

2) you are trying to search for DATA, not INFORMATION. It is a big
difference. For your task, you could rather use simpler engine that is
based on RDBMS and B+.

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




RE: Lucene Benchmarks and Information

2002-12-24 Thread Leo Galambos
On Mon, 23 Dec 2002, Armbrust, Daniel C. wrote:

> >IMHO it is a bug and the
> >point why Lucene does not scale well on huge collections of documents. I
> >am talking about my previous tests when I used live index and concurrent
> >query+insert+delete (I wanted to simulate real application).
> 
> [snip]
> 
> What is your definition of huge?  I have yet to have a problem, and I am

TREC-3 and above, or >20GB of real (i.e. HTML) docs.

> B.  I know the impact on search times of adding more documents

you know it for optimal case, because your inverted lists may have
identical length. It implies the linearity between space and query time. 

BTW: My note was not against you or your tests. My note was for better
JAVA engine(s). :)

-g-






--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Lucene Benchmarks and Information

2002-12-21 Thread Leo Galambos
On Fri, 20 Dec 2002, Doug Cutting wrote:

> The max a reader will keep open is:
> 
>mergeFactor * log_base_mergeFactor(N) * files_per_segment
> 
> A writer will open:
> 
>(1 + mergeFactor) * files_per_segment

I am not sure if you must open all files (i.e. writer would need just
2*f_p_s if you keep A-Z order in DocUIDs??). IMHO it is a bug and the
point why Lucene does not scale well on huge collections of documents. I
am talking about my previous tests when I used live index and concurrent
query+insert+delete (I wanted to simulate real application).

BTW, your mail is also an answer to previous topic "how often could one
call optimize()". The method would be called before the index goes to
production state. And it also means that tests are irrelevant until they
are made with lower mergeFactor.

...but it is possible that I missed something (I do not know Lucene as 
good as you).

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




HTML saga continues...

2002-12-12 Thread Leo Galambos
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 
parser.

THX

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: SV: Indexing HTML

2002-12-07 Thread Leo Galambos
> I'm not sure this is a solution to your problem. However, it seems that the
> HTMLParser used by the IndexHTML class has problems parsing the document
> (there is a test class included in the jar):
> 
> 
> >java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
> org.apache.lucene.demo.html.Test f01529.txt
> Title: Webcz.cz - Power of search
> Parse Aborted: Encountered "\'" at line 106, column 27.
> Was expecting one of:
>  ...
>  ...
> /Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug. If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered: "\u0178" 
(376), after : ""
:
adding save/d00320/f01527.html
Parse Aborted: Encountered "=" at line 83, column 48.
Was expecting one of:
 ...
 ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: Lucene Speed under diff JVMs

2002-12-05 Thread Leo Galambos
On Thu, 5 Dec 2002, Armbrust, Daniel C. wrote:

> I'm using the class that Otis wrote (see message from about 3 weeks ago)
> for testing the scalability of lucene (more results on that later) and I

May I ask you where one can get the source code? I cannot find it in 
archive. Thank you

-g-



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Indexing HTML

2002-12-03 Thread Leo Galambos
I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but
Lucene often falls to never-ending loop. I've analyzed my data, so I know
what file(s) sent Lucene down. I don't see anything special in the
file(s), so I think, that it can go throught parser to main Lucene
routines (and then the problem could be in Merger).

Could you help me, please?

One of the problematic files:
http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt
My program (based on Lucene demo): 
http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java

Thank you very much.

-g-


--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Performance (figures)

2002-11-30 Thread Leo Galambos
The first round of tests is presented here (more will come later):

1) http://com-os2.ms.mff.cuni.cz/proof.png

Price per insert (time, space).
Doc base: 5M HTML *.CZ
Collection size: 300K docs were processed; then Lucene crashed (it may be
my fault, but I haven't time to debug it now)
Optimize() after 2000 of docs (IMHO this simulates dynamic IR 
environment, i.e. indexing emails, news groups etc.).

For instance (see Fig. 1):
collection size/time per insert()
2000/25ms
16/33ms
30/48ms

It means that for collection of 16 docs you need 16*33ms=5280s.

2) http://com-os2.ms.mff.cuni.cz/draw.png

Absolute values



If someone is able to say how often I would call optimize(), I can 
recalculate the results. Now the 2nd round of tests is running (without 
optimize()).

-g-

BTW: All figures, (C) 2002 Leo Galambos. Do not copy until I am sure that 
the tests&values are correct.


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: optimize()

2002-11-27 Thread Leo Galambos
> Unoptimized index is not a problem for document additions, they take
> constant time, regardless of the size of the index and regardless of
> whether the index is optimized or not.

IMHO It is not true. It would mean that O(log(n/M))=O(1).  (n-number of
documents in index, M max number of segments per level). I think that if
you are true, we are able to sort an array in O(n) and not in O(nlog n).

> Searches of unoptimized index take longer than searches of an optimized
> index.

Is there any limitation in Lucene architecture, so that you cannot use
multithread algorithm for calculation of hit lists? I think it would boost
performance. Otis, thank you for your proof, that Lucene has not it now
(you got me :-)). But what about next releases?

> Then do a search against one, and against the other index, and time it.
> Then let us know which one is faster and by how much.

OK, I will.

I would like to compare Lucene to another engine. The test would be
precise, because I wanna use it in an academic paper.

Aim of my question was, how could I configure Lucene to get maximum
performance for test. It looks to be pretty hard, because:

- if I do not call optimize(), I can build index at maximum speed, but 
searches are slow, so it is not configuration for dynamic environment

- if I call optimize() regularly (as real application would do), indexing
is slower and slower when I add more and more documents to the collection

IMHO the second option describes "real environment", so we get:

loop:
  K-times indexDoc()
  optimize()
end-of-loop

What *K* would I use? 1000, 1 or 10 or 100? Folks, what *K* do you use 
in your applications? Thank you.

-g-



--
To unsubscribe, e-mail:   
For additional commands, e-mail: 




Re: optimize()

2002-11-26 Thread Leo Galambos
Hmmm. The question is what would I measure?

Otis, do you know what implementation is used in Lucene (I am lost in 
hiearchy of readers/writers):

a) single thread for solving query
b) more than one thread for a query

(a) would mean that Lucene could solve queries more than 50% slower
than in case (b). It would also mean, that Lucene's index is in optimal
state when just one segment exists. And it also means that if you remove
half of documents from a collection you have to rebuild one big segment to
a smaller one, and so on... It would cost a lot of CPU/HDD time.

So it looks like I would measure effect of random insert/remove 
operations. The problem is, how often I would call optimize in the test?

Any thoughts?

-g-

On Tue, 26 Nov 2002, Otis Gospodnetic wrote:

> No tests, just intuition that it's faster to find something in 1 file
> than in 100 of them.  If you do some tests, I'd love to hear the real
> numbers :)
> 
> Otis
> 
> --- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > Did you try any tests in this area? (figures, charts...)
> > 
> > AFAIK reader reads identical number of (giga)bytes. BTW, it could
> > read
> > segments in many threads. I do not see why it would be slower (until
> > you
> > do many delete()-s). If reader opens 1 or 50 files, it is still
> > nothing.
> > 
> > -g-
> > 
> > On Tue, 26 Nov 2002, Otis Gospodnetic wrote:
> > 
> > > This was just mentioned a few days ago. Check the archives.
> > > Not needed for indexing, good to do after you are done indexing, as
> > the
> > > index reader needs to open and search through less files.
> > > 
> > > Otis
> > > 
> > > --- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > > > How does it affect overall performance, when I do not call
> > > > optimize()?
> > > > 
> > > > THX
> > > > 
> > > > -g-
> > > > 
> > > > 
> > > > 
> > > > --
> > > > To unsubscribe, e-mail:  
> > > > <mailto:[EMAIL PROTECTED]>
> > > > For additional commands, e-mail:
> > > > <mailto:[EMAIL PROTECTED]>
> > > > 
> > > 
> > > 
> > > __
> > > Do you Yahoo!?
> > > Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> > > http://mailplus.yahoo.com
> > > 
> > > --
> > > To unsubscribe, e-mail:  
> > <mailto:[EMAIL PROTECTED]>
> > > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > 
> 
> 
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




Re: optimize()

2002-11-26 Thread Leo Galambos
Did you try any tests in this area? (figures, charts...)

AFAIK reader reads identical number of (giga)bytes. BTW, it could read
segments in many threads. I do not see why it would be slower (until you
do many delete()-s). If reader opens 1 or 50 files, it is still nothing.

-g-

On Tue, 26 Nov 2002, Otis Gospodnetic wrote:

> This was just mentioned a few days ago. Check the archives.
> Not needed for indexing, good to do after you are done indexing, as the
> index reader needs to open and search through less files.
> 
> Otis
> 
> --- Leo Galambos <[EMAIL PROTECTED]> wrote:
> > How does it affect overall performance, when I do not call
> > optimize()?
> > 
> > THX
> > 
> > -g-
> > 
> > 
> > 
> > --
> > To unsubscribe, e-mail:  
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > 
> 
> 
> __
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>




optimize()

2002-11-26 Thread Leo Galambos
How does it affect overall performance, when I do not call optimize()?

THX

-g-



--
To unsubscribe, e-mail:   
For additional commands, e-mail: