Re: 1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-28 Thread Kevin A. Burton
Doug Cutting wrote:
The default value is probably good for all but folks with very large 
indexes, who may wish to increase the default somewhat.  Also folks 
with smaller indexes and very high query volumes may wish to decrease 
the default.  It's a classic time/memory tradeoff.  Higher values use 
less memory and make searches a bit slower, smaller values use more 
memory and make searches a bit faster.
BTW.. can you define "a bit"...
Is "a bit" 5%?  10%?  Benchmarks would be ncie but I'm not that picky.  
I just want to see what performance hits/benefits I could see by 
tweaking the values.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-24 Thread Kevin A. Burton
Whats the desired pattern of using of TermInfosWriter.indexInterval ?
Do I have to compile my own version of Lucene to change this?   The last 
API was public static final but this is not public nor static. 

I'm wondering if we should just make this a value that can be set at 
runtime.  Considering the memory savings for larger installs this 
can/will be important.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Not without hacking things.  If your 1.3 indexes were generated with 
256 then you can modify your version of Lucene 1.4+ to use 256 instead 
of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 
today).

Prior to 1.4 this was a constant, hardwired into the index format.  In 
1.4 and later each index segment stores this value as a parameter.  So 
once 1.4 has re-written your index you'll no longer need a modified 
version.
Thanks for the feedback doug. 

This makes more sense now. I didn't understand why the website 
documented the fact that the .tii file was soring the index interval.

I think I'm going to investigate just moving to 1.4 ...  I need to do it 
anyway.  Might as well bite the bullet now.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: ngramj

2005-02-24 Thread Kevin A. Burton
petite_abeille wrote:
On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote:
Does anyone know a good tutorial or the javadoc for ngramj because i 
need it for guessing the language of the documents which should be 
indexed?

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ 
languageidentifier/
Wow.. interesting! Where'd this come from?
I actually wrote an implementation of NGram language categorization a 
while back. I'll have to check this out. I'm willing to bet mine's 
better though ;)

I was going to put it in Jakarta Commons...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.

It looks like you're using a pre-1.4 version of Lucene.  Since 1.4 
this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather 
TermInfosWriter.indexInterval.
Yes... we're trying to be conservative and haven't migrated yet.  Though 
doing so might be required for this move I think...

Is this setting incompatible with older indexes burned with the lower 
value?

Prior to 1.4, yes.  After 1.4, no.
What happens after 1.4?  Can I take indexes burned with 256 (a greater 
value) in 1.3 and open them up correctly with 1.4?

Kevin
PS.  Once I get this working I'm going to create a wiki page documenting 
this process.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Kevin A. Burton wrote:
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.

You know... it looks like the problem is that TermInfosReader uses 
INDEX_INTERVAL during seeks and is probably just jumping RIGHT past 
the offsets that I need.
I guess I'm thinking out loud here...
Looks like the only thing written to the tii index for metainfo is the 
"size" of the index.  Its an int and is the int of the stream (which is 
reserved).

Now I'm curious if there's any other way I can infer this value
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Kevin A. Burton wrote:
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.
You know... it looks like the problem is that TermInfosReader uses 
INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the 
offsets that I need.

If this is going to be a practical way of reducing Lucene memory 
footprint for HUGE indexes then its going to need a way to change this 
value based on the current index thats being opened.

Is there anyway to determine the INDEX_INTERVAL from the file?It 
looks according to:

http://jakarta.apache.org/lucene/docs/fileformats.html
That the .tis file (which according to the docs the .tii file "is very 
similar to the .tis file" ) should have this data:

So according to this:
TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, 
SkipInterval, TermInfos

The only problem is that the .tii and .tis files I have on disk don't 
have a constant preamble and doesnt' look like there's an index interval 
here...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
David Sitsky wrote:
On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote:
 

You are right.
Since there are C++ and now C ports of Lucene, it would be interesting
to integrate them directly with DBs, so that the RDBMS full-text search
under the hood is actually powered by one of the Lucene ports.
   

Or to see Lucene + Derby (100% JAVA embedded database donated from IBM 
currently in Apache incubation) integrated together... that would be 
really nice and powerful.

Does anyone know if there are any integration plans?
 

Don't forget BerkeleyDB Java  Edition... that would be interesting too...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
Otis Gospodnetic wrote:
The most obvious answer is that the full-text indexing features of
RDBMS's are not as good (as fast) as Lucene.  MySQL, PostgreSQL,
Oracle, MS SQL Server etc. all have full-text indexing/searching
features, but I always hear people complaining about the speed.  A
person from a well-known online bookseller told me recently that Lucene
was about 10x faster that MySQL for full-text searching, and I am
currently helping someone get away from MySQL and into Lucene for
performance reasons.
 

Also... MySQL full text search isn't perfect. If you're not a java 
programmer it would be difficult to hack on. Another downside is that FT 
in MySQL only works with MyISAM tables which aren't transaction aware 
and use global tables locks (not fun).

I'm sure though that MySQL would do a better job at online index 
maintenance than Lucene. It falls down a bit in this area...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
I finally had some time to take Doug's advice and reburn our indexes 
with a larger TermInfosWriter.INDEX_INTERVAL value.

The default is 128 but I increased it to 256 and then burned our indexes 
again and was lucky enough to notice that our memory usage dropped in 1/2.

This introduced a bug however where when we try to load our pages before 
and after we're missing 99% of documents from our index.  What happens 
is that we have a term -> key mapping so that we can pull out documents 
based on essentially a primary key.  The key is just hte URL of the 
document.  With the default value it works fine but when I change it to 
256 it cant' find the majority of the documents.  In fact its only able 
to find one.

Is this setting incompatible with older indexes burned with the lower value?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-02-15 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

You can increase TermInfosWriter.indexInterval.  You'll need to 
re-write the .tii file for this to take effect.  The simplest way to 
do this is to use IndexWriter.addIndexes(), adding your index to a 
new, empty, directory.  This will of course take a while for a 60GB 
index...

(Note... when this works I'll note my findings in a wiki page for future 
developers)

Two more questions:
1.  Do I have to do this with a NEW directory?  Our nightly index merger 
uses an existing "target" index which I assume will re-use the same 
settings as before?  I did this last night and it still seems to use the 
same amount of memory.  Above you assert that I should use a new empty 
directory and I'll try that tonight.

2. This isn't destructive is it?  I mean I'll be able to move BACK to a 
TermInfosWriter.indexInterval of 128 right?

Thanks!
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


DbDirectory and Berkeley DB Java Edition...

2005-02-06 Thread Kevin A. Burton
I'm reading the Lucene in Action book right nowand on page 309 they talk 
about using the DbDirectory which berkeley DB for maintaining your index.

Anyone ever consider a port to Berkeley DB Java Edition?
The only downside would be the license (I think its GPL) but it could 
really free up the time it takes to optimize() I think.  You could just 
rehash the hashtable and then insert rows into the new table.

Would be interesting to benchmark I think though.
Thoughts?
http://www.sleepycat.com/products/je.shtml
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.
 

I loaded it into a profiler a long time ago. Most of the code was due to 
Term classes being loaded into memory.

I might try to get some time to load it into a profiler on monday...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Chris Hostetter wrote:
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open
Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint "settles down" after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.
 

Actually I haven't but to be honest the numbers seem dead on. The VM 
heap wouldn't reallocate if it didn't need that much memory and this is 
almost exactly the behavior I'm seeing in product.

Though I guess it wouldn't hurt ;)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Paul Elschot wrote:
This would be similar to the way the MySQL index cache works...
   

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...
 

The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
Kevin A. Burton wrote:
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides 
open this index.
After thinking about it I guess 1.5% of memory per index really isn't 
THAT bad.  What would be nice if there was a way to do this from disk 
and then use the a buffer (either via the filesystem or in-vm memory) to 
access these variables.

This would be similar to the way the MySQL index cache works...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Opening up one large index takes 940M or memory?

2005-01-21 Thread Kevin A. Burton
We have one large index right now... its about 60G ... When I open it 
the Java VM used 940M of memory.  The VM does nothing else besides open 
this index.

Here's the code:
   System.out.println( "opening..." );
   long before = System.currentTimeMillis();
   Directory dir = FSDirectory.getDirectory( 
"/var/ksa/index-1078106952160/", false );
   IndexReader ir = IndexReader.open( dir );
   System.out.println( ir.getClass() );
   long after = System.currentTimeMillis();
   System.out.println( "opening...done - duration: " + 
(after-before) );

   System.out.println( "totalMemory: " + 
Runtime.getRuntime().totalMemory() );
   System.out.println( "freeMemory: " + 
Runtime.getRuntime().freeMemory() );

Is there any way to reduce this footprint?  The index is fully 
optimized... I'm willing to take a performance hit if necessary.  Is 
this documented anywhere?

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: nable to read TLD "META-INF/c.tld" from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Most definitely Jetty.  I can't believe you're using Tomcat for Rojo!
;)
 

I never said we were using Tomcat for Rojo ;)
Sorry about that btw... wrong list!
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


nable to read TLD "META-INF/c.tld" from JAR file ... standard.jar

2004-12-23 Thread Kevin A. Burton
What in the world is up with this exception?
We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if 
I try to start with a FRESH webapp or try to update any of the JSPs and in-place and 
recompile I'll get this error:

Any idea?
I thought maybe the .jar files were corrupt but if I md5sum them they are identical to 
production and the Tomcat standard dist.

Thoughts?
org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD 
"META-INF/c.tld" from JAR file 
"file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/standard.jar": 
org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: 
org.apache.taglibs.standard.tlv.JstlCoreTLV

org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:39)

org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:405)

org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:86)

org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:339)
org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:372)
org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539)
org.apache.jasper.compiler.Parser.parse(Parser.java:126)

org.apache.jasper.compiler.ParserController.doParse(ParserController.java:211)

org.apache.jasper.compiler.ParserController.parse(ParserController.java:100)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:286)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:267)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:255)

org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:556)

org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:296)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245)
javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reloading LARGE index causes OutOfMemory... intern Terms?

2004-12-17 Thread Kevin A. Burton
We nighly optimizing one of our main indexes which takes up about 70% of 
system memory when loaded due to Term objects being stored in memory.

We perform the optimization out of process then tell Tomcat to reload 
its index.  This then causes us to open the index again which would need 
140% of system memory and so causes an  OutOfMemory exception.

Whats the best way to handle this?  Do I open the index again or is 
there a better way to tell Lucene that I'm reloading an existing index 
so it uses less memory?

Is it possible to intern the Term objects so that I only have one term 
per virtual machine instead of one per index?

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to index Windows' Compiled HTML Help (CHM) Format

2004-12-11 Thread Kevin A. Burton
Tom wrote:
Hi,
Does anybody know how to index chm-files? 
A possible solution I know is to convert chm-files to pdf-files (there are
converters available for this job) and then use the known tools (e.g.
PDFBox) to index the content of the pdf files (which contain the content of
the chm-files). Are there any tools which can directly grab the textual
content out of the (binary) chm-files?

I think chm-file indexing-support is really a big missing piece in the
currently supported indexable filetype-collection (XML, HTML, PDF,
MSWord-DOC, RTF, Plaintext). 
 

I believe its just a Microsoft .cab file with an index.html inside it... 
am I right?

just uncompress it.
The problem is that the HTML within them isn't any way NEAR standard and 
you can't really give them to the user in the UI...

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: lucene in action ebook

2004-12-09 Thread Kevin A. Burton
Erik Hatcher wrote:
I have the e-book PDF in my possession. I have been prodding Manning 
on a daily basis to update the LIA website and get the e-book 
available. It is ready, and I'm sure that its just a matter of them 
pushing it out. There may be some administrative loose ends they are 
tying up before releasing it to the world. It should be available any 
minute now, really. :)
Send off a link to the list when its out...
We're all holding our breath ;)
(seriously)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: JDBCDirectory to prevent optimize()?

2004-11-23 Thread Kevin A. Burton
Erik Hatcher wrote:
Also, there is a DBDirectory in the sandbox to store a Lucene index 
inside Berkeley DB.
I assume this would prevent prefix queries from working...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


JDBCDirectory to prevent optimize()?

2004-11-22 Thread Kevin A. Burton
It seems that when compared to other datastores that Lucene starts to 
fall down.  For example lucene doesn't perform online index 
optimizations so if you add 10 documents you have to run optimize() 
again and this isn't exactly a fast operation.

I'm wondering about the potential for a generic JDBCDirectory for 
keeping the lucene index within a database. 

It sounds somewhat unconventional would allow you to perform live 
addDirectory updates without performing an optimize() again.

Has anyone looked at this?  How practical would it be.
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Also another note is that doing an index merge in memory is probably 
faster if you just use a RAMDirectory and perform addIndexes to it.

This would almost certainly be faster than optimizing on disk but I 
haven't benchmarked it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Index in RAM - is it realy worthy?

2004-11-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
For the Lucene book I wrote some test cases that compare FSDirectory
and RAMDirectory.  What I found was that with certain settings
FSDirectory was almost as fast as RAMDirectory.  Personally, I would
push FSDirectory and hope that the OS and the Filesystem do their share
of work and caching for me before looking for ways to optimize my code.
 

Yes... I performed the same benchmark and in my situation RAMDirectory 
for searches was about 2% slower.

I'm willing to bet that it has to do with the fact that its a Hashtable 
and not a HashMap (which isn't synchronized).

Also adding a constructor for the term size could make loading a 
RAMDirectory faster since you could prevent rehash.

If you're on a modern machine your filesystme cache will end up 
buffering your disk anyway which I'm sure was happening in my situation.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Mozilla Desktop Search

2004-11-13 Thread Kevin A. Burton
  
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/

The Mozilla foundation may be considering a desktop search 
implementation 
<http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10> 
:

Having launched the much-awaited Version 1.0 of the Firefox
browser yesterday (see story), The Mozilla Foundation is busy
planning enhancements to the open-source product, including the
possibility of integrating it with a variety of desktop search
tools. The Mozilla Foundation also wants to place Firefox in PCs
through reseller deals with PC hardware vendors and continue to
sharpen the product's pop-up ad-blocking technology. 

I'm not sure this is a good idea. Maybe it is though. The technology 
just isn't there for cross platform search.

I'd have to suggest using Lucene but using GCJ for a native compile 
into XPCOM components but I'm not sure if GCJ is up to the job here. 
If this approach is possible then I'd be very excited.

One advantage to this approach is that an HTTP server wouldn't be 
necessary since you're already within the brower.

Good for everyone involved. No bloated Tomcat causing problem and 
blazingly fast access within the browser. Also since TCP isn't 
involved you could gracefully fail when the search service isn't 
running; you could just start it.

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Lucene external field storage contribution.

2004-11-07 Thread Kevin A. Burton
About 3 months ago I developed a external storage engine which ties into 
lucene. 

I'd like to discuss making a contribution so that this is integrated 
into a future version of Lucene.

I'm going to paste my original PROPOSAL in this email. 

There wasn't a ton of feedback first time around but I figure squeaky 
wheel gets the grease...



I created this proposal because we need this fixed at work. I want to 
go ahead and work on a vertical fix for our version of lucene and then 
submit this back to Jakarta.
There seems to be a lot of interest here and I wanted to get feedback 
from the list before moving forward ...

Should I put this in the wiki?!
Kevin
** OVERVIEW **
Currently Lucene supports 'stored fields; where the content of these 
fields are
kept within the lucene index for use in the future.

While acceptable for small indexes, larger amounts of stored fields 
prevent:

- Fast index merges since the full content has to be continually merged.
- Storing the indexes in memory (since a LOT of memory would be 
required and
this is cost prohibitive)

- Fast queries since block caching can't be used on the index data.
For example in our current setup our index size is 20G.  Nearly 90% of 
this is
content.  If we could store the content outside of Lucene our merges and
searches would be MUCH faster.  If we could store the index in MEMORY 
this could
be orders of magnitude faster.

** PROPOSAL **
Provide an external field storage mechanism which supports legacy indexes
without modification.  Content is stored in a "content segment". The only
changes would be a field with 3(or 4 if checksum enabled) values.
- CS_SEGMENT
  Logical ID of the content segment.  This is an integer value.  
There is
  a global Lucene property named CS_ROOT which stores all the 
content.
  The segments are just flat files with pointers.  Segments are 
broken
  into logical pieces by time and size.  Usually 100M of content 
would be
  in one segment.

- CS_OFFSET
  The byte offset of the field.
- CS_LENGTH
  The length of the field.
- CS_CHECKSUM
  Optional checksum to verify that the content is correct when 
fetched
  from the index.

- The field value here would be exactly 'N:O:L' where N is the segment 
number,
  O is the offset, and L is the length.  O and L are 64bit values.  N 
is a 32
  bit value (though 64bit wouldn't really hurt).

This mechanism allows for the external storage of any named field.
 
CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO 
code for
efficient content lookup.  (Though filehandle caching should probably 
be used).

Since content is broken into logical 100M segments the underlying 
filesystem can
orgnize the file into contiguous blocks for efficient non-fragmented 
lookup.

File manipulation is easy and indexes can be merged by simply 
concatenating the
second file to the end of the first.  (Though the segment, offset, and 
length
need to be updated).  (FIXME: I think I need to think about this more 
since I
will have < 100M per syncs)

Supporting full unicode is important.  Full java.lang.String storage 
is used
with String.getBytes() so we should be able to avoid unicode issues.  
If Java
has a correct java.lang.String representation it's possible easily add 
unicode
support just by serializing the byte representation. (Note that the 
JDK says
that the DEFAULT system char encoding is used so if this is ever 
changed it
might break the index)

While Linux and modern versions of Windows (not sure about OSX) 
support 64bit
filesystems the 4G storage boundary of 32bit filesystems (ext2 is an 
example)
are an issue.  Using smaller indexes can prevent this but eventually 
segment
lookup in the filesystem will be slow.  This will only happen within 
terabyte
storage systems so hopefully the developer has migrated to another 
(modern)
filesystem such as XFS.

** FEATURES **
  - Must be able to replicate indexes easily to other hosts.
  - Adding content to the index must be CHEAP
  - Deletes need to be cheap (these are cheap for older content.  Just 
discard
older indexes)

  - Filesystem needs to be able to optimize storage
  - Must support UNICODE and binary content (images, blobs, byte arrays,
serialized objects, etc)
  - Filesystem metadata operations should be fast.  Since content is 
kept in
LARGE indexes this is easy to avoid.

  - Migration to the new system from legacy indexes should be fast and
painless for future developers
 
 


--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfbur

Ability to apply document age with the score?

2004-10-28 Thread Kevin A. Burton
Lets say I have an index with two documents.  They both have the same 
score but one was added 6 months ago and the other was added 2 minutes ago.

I want the score adjusted based on the age so that older documents have 
a lower score.

I don't want to sort by document age (date) because if one document is 
older but has a HIGHER score it would be better to have it rise above 
newer documents that have a lower score.

Is this possible?  The only way I could think of doing it would be to 
have a DateFilter and then apply a dampening after the query.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lots Of Interest in Lucene Desktop

2004-10-28 Thread Kevin A. Burton
I've made a few passive mentions of my Lucene 
<http://jakarta.apache.org/lucene> Desktop prototype here on PeerFear 
in the last few days and I'm amazed how much feedback I've had. People 
really want to start work on an Open Source desktop search based on 
Lucene.


http://www.peerfear.org/rss/permalink/2004/10/28/LotsOfInterestInLuceneDesktop/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Documents with 1 word are given unfair lengthNorm()

2004-10-27 Thread Kevin A. Burton
WRT to my blog post:
It seems the problem is that the distribution for lengthNorm() starts at 
1 and moves down from there.  1.0f would work but HUGE documents would 
be normalized and so would distort the results.

What would you think of using this implementation for lengthNorm:
public float lengthNorm( String fieldName, int numTokens ) {
int THRESHOLD = 50;

int nt = numTokens;

if ( numTokens <= THRESHOLD )
++nt;

if ( numTokens > THRESHOLD )
nt -= THRESHOLD;

float v = (float)(1.0 / Math.sqrt(nt));

if ( numTokens <= THRESHOLD )
v = 1 - v;
return v;
}
This starts the distribution low... approaches 1.0 when 50 terms are in 
the document... then asymptotically moves to zero from here on out based 
on sqrt.

For example with values from 1 -> 150 would yield (I'd graph this out 
but I'm too lazy):

1 - 0.29289323
2 - 0.42264974
3 - 0.5
4 - 0.5527864
5 - 0.5917517
6 - 0.6220355
7 - 0.6464466
8 - 0.666
9 - 0.6837722
10 - 0.69848865
11 - 0.7113249
12 - 0.72264993
13 - 0.73273873
14 - 0.74180114
15 - 0.75
16 - 0.7574644
17 - 0.7642977
18 - 0.7705843
19 - 0.7763932
20 - 0.7817821
21 - 0.7867993
22 - 0.7914856
23 - 0.79587585
24 - 0.8
25 - 0.80388385
26 - 0.8075499
27 - 0.81101775
28 - 0.81430465
29 - 0.81742585
30 - 0.8203947
31 - 0.8232233
32 - 0.82592237
33 - 0.8285014
34 - 0.83096915
35 - 0.833
36 - 0.83560103
37 - 0.83777857
38 - 0.8398719
39 - 0.8418861
40 - 0.84382623
41 - 0.8456966
42 - 0.8475014
43 - 0.84924436
44 - 0.8509288
45 - 0.852558
46 - 0.85413504
47 - 0.85566247
48 - 0.85714287
49 - 0.8585786
50 - 0.859972
51 - 1.0
52 - 0.70710677
53 - 0.57735026
54 - 0.5
55 - 0.4472136
56 - 0.4082483
57 - 0.37796447
58 - 0.35355338
59 - 0.3334
60 - 0.31622776
61 - 0.30151135
62 - 0.28867513
63 - 0.2773501
64 - 0.26726124
65 - 0.2581989
66 - 0.25
67 - 0.24253562
68 - 0.23570226
69 - 0.22941573
70 - 0.2236068
71 - 0.2182179
72 - 0.21320072
73 - 0.2085144
74 - 0.20412415
75 - 0.2
76 - 0.19611613
77 - 0.19245009
78 - 0.18898223
79 - 0.18569534
80 - 0.18257418
81 - 0.1796053
82 - 0.17677669
83 - 0.17407766
84 - 0.17149858
85 - 0.16903085
86 - 0.1667
87 - 0.16439898
88 - 0.16222142
89 - 0.16012815
90 - 0.15811388
91 - 0.15617377
92 - 0.15430336
93 - 0.15249857
94 - 0.15075567
95 - 0.1490712
96 - 0.14744195
97 - 0.145865
98 - 0.14433756
99 - 0.14285715
100 - 0.14142136
101 - 0.14002801
102 - 0.13867505
103 - 0.13736056
104 - 0.13608277
105 - 0.13483997
106 - 0.13363062
107 - 0.13245323
108 - 0.13130644
109 - 0.13018891
110 - 0.12909944
111 - 0.12803689
112 - 0.12700012
113 - 0.12598816
114 - 0.125
115 - 0.12403473
116 - 0.12309149
117 - 0.12216944
118 - 0.12126781
119 - 0.120385855
120 - 0.11952286
121 - 0.11867817
122 - 0.11785113
123 - 0.11704115
124 - 0.11624764
125 - 0.11547005
126 - 0.114707865
127 - 0.11396058
128 - 0.1132277
129 - 0.11250879
130 - 0.1118034
131 - 0.
132 - 0.11043153
133 - 0.10976426
134 - 0.10910895
135 - 0.10846523
136 - 0.107832775
137 - 0.107211255
138 - 0.10660036
139 - 0.10599979
140 - 0.10540926
141 - 0.104828484
142 - 0.1042572
143 - 0.10369517
144 - 0.10314213
145 - 0.10259783
146 - 0.10206208
147 - 0.10153462
148 - 0.101015255
149 - 0.10050378

--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
Daniel Naber wrote:
(Kevin complains about shorter documents ranked higher)
This is something that can easily be fixed. Just use a Similarity 
implementation that extends DefaultSimilarity and that overwrites 
lengthNorm: just return 1.0f there. You need to use that Similarity for 
indexing and searching, i.e. it requires reindexing.
 

What happens when I do this with an existing index? I don't want to have 
to rewrite this index as it will take FOREVER

If the current behavior is all that happens this is fine... this way I 
can just get this behavior for new documents that are added.

Also... why isn't this the default?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Poor Lucene Ranking for Short Text

2004-10-27 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Google Desktop Could be Better

2004-10-15 Thread Kevin A. Burton
http://www.peerfear.org/rss/permalink/2004/10/15/GoogleDesktopCouldBeBetter/
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Prevent Lucene from returning short length text...

2004-10-03 Thread Kevin A. Burton
I've noticed that Lucene does a very bad job at doing search ranking 
when text has just a few words in the body.

For example if you searched for the word "World" in the following two 
paragraphs:

"Hello World"
and
"The World is often a dangerous place"
The first paragraph wuold probably match.
Is there a way I can tweak lucene to return richer content?
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
Jiří Kuhn wrote:
Hi,
I think I can reproduce memory leaking problem while reopening an index. 
Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is:
$ java -version
java version "1.4.2_05"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04)
Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode)
The code you can test is below, there are only 3 iterations for me if I use 
-Xmx5m, the 4th fails.
 

At least this test seems tied to the Sort API... I removed the sort 
under Lucene 1.3 and it worked fine...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example

2004-09-13 Thread Kevin A. Burton
David Spencer wrote:
JiÅÃ Kuhn wrote:
This doesn't work either!

You're right.
I'm running under JDK1.5 and trying larger values for -Xmx and it 
still fails.

Running under (Borlands) OptimzeIt shows the number of Terms and 
Terminfos (both in org.apache.lucene.index) increase every time thru 
the loop, by several hundred instances each.
Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 
1.3... I used the JMP debugger and all my memory is taken by Terms and 
TermInfo...

I can trace thru some Term instances on the reference graph of 
OptimizeIt but it's unclear to me what's right. One *guess* is that 
maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the 
problem.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: IRC?!

2004-09-11 Thread Kevin A. Burton
Harald Tijink wrote:
I hope your idea isn't to replace this Users List and pull the
discussions into the IRC scene. I (and most of us) can not attend to any
IRC chat because of work and other priorities. This list gives me the
opportunity to keep informed ("involved").
 

Yup... I want to replace the mailing lists, wiki, website, CVS, and 
Bugzilla with IRC. And if you can't keep up thats just your fault ;) (joke).

Its just another tool ;)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


IRC?!

2004-09-10 Thread Kevin A. Burton
There isn't a Lucene IRC room is there (at least there isn't according 
to Google)?

I just joined #lucene on irc.freenode.net if anyone is interested...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Kevin A. Burton
Daniel Taurat wrote:
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
Depends on what OS and with what patches...
Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... 
There are some patches to apply to get 3G but only on really modern kernels.

I just need to get Athlon systems :-/
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


TermInfo using 300M for large index?

2004-09-10 Thread Kevin A. Burton
I'm trying to do some heap debugging of my application to find a memory 
leak.

Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which 
consumed 300M ... this is of course for a 40G index.

Is this normal and is there any way I can streamline this?
We are of course caching the IndexSearchers but I want to reduce the 
memory footprint...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Anyone avail for Lucene consulting or employment in the SF area?

2004-09-05 Thread Kevin A. Burton
Hope no one considers this spam ;)
We're hiring either someone full-time who has strong experience with 
Java, Lucene, and Jakarta technologies or someone to act as a consultant 
working on Lucene for about a month optimizing our search infra.

This is for a startup located in downtown SF.
Send me an email including your resume (html or text only) and I'll 
respond with full details.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Possible to remove duplicate documents in sort API?

2004-09-05 Thread Kevin A. Burton
Paul Elschot wrote:
Kevin,
On Sunday 05 September 2004 10:16, Kevin A. Burton wrote:
 

I want to sort a result set but perform a group by as well... IE remove
duplicate items.
   

Could you be more precise?
 

My problem is that I have two machines... one for searching, one for 
indexing.

The searcher has an existing index.
The indexer found an UPDATED document and then adds it to a new index 
and pushes that new index over to the searcher.

The searcher then reloads and when someone performs a search BOTH 
documents could show up (including the stale document).

I can't do a delete() on the searcher because the indexer doesn't have 
the entire index as the searcher.

Therefore I wanted to group by the same document ID but this doesn't 
seem possible.  This should suppress the stale document and prefer the 
newer doc.

Is this possible with the new API?  Seems like a huge drawback to lucene
right now.
   

In case you can define another field that defines what is a duplicate
by having the same value for duplicates, you can use it as one of the
SortField's for sorting.
 

I have this duplicate field...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)

2004-09-05 Thread Kevin A. Burton
It looks like Document.java uses its own implementation of a LinkedList..
Why not use a HashMap to enable O(1) lookup... right now field lookup is 
O(N) which is certainly no fun.

Was this benchmarked?  Perhaps theres the assumption that since 
documents often have few fields the object overhead and hashcode 
overhead would have been less this way.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Possible to remove duplicate documents in sort API?

2004-09-05 Thread Kevin A. Burton
I want to sort a result set but perform a group by as well... IE remove 
duplicate items. 

Is this possible with the new API?  Seems like a huge drawback to lucene 
right now.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Patch for IndexWriter.close which prevents NPE...

2004-09-03 Thread Kevin A. Burton
I just attached a patch which:
1. prevents multiple close() of an IndexWriter
2. prevents an NPE if the writeLock was null.
We have been noticing this from time to time and I haven't been able to 
come up with a hard test case.  This is just a bit of defensive 
programming to prevent it from happening in the first place.  It would 
happen from time to time without any reliable cause.

Anyway...
Thanks...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

--- IndexWriter.java.bak.close  2004-09-03 11:27:37.0 -0700
+++ IndexWriter.java2004-09-03 11:32:02.0 -0700
@@ -107,6 +107,11 @@
*/  
   private boolean useCompoundFile = false;
 
+  /**
+   * True when we have closed this IndexWriter
+   */
+  protected boolean isClosed = false;
+
   /** Setting to turn on usage of a compound file. When on, multiple files
*  for each segment are merged into a single file once the segment creation
*  is finished. This is done regardless of what directory is in use.
@@ -183,15 +188,27 @@
 }.run();
 }
   }
-
+
   /** Flushes all changes to an index, closes all associated files, and closes
 the directory that the index is stored in. */
   public synchronized void close() throws IOException {
+
+if ( isClosed ) {
+  return;
+}
+
 flushRamSegments();
 ramDirectory.close();
-writeLock.release();  // release write lock
+
+if ( writeLock != null ) {
+  // release write lock
+  writeLock.release();
+}
+
 writeLock = null;
 directory.close();
+isClosed = true;
+  
   }
 
   /** Release the write lock, if needed. */

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Benchmark of filesystem cache for index vs RAMDirectory...

2004-08-08 Thread Kevin A. Burton
Daniel Naber wrote:
On Sunday 08 August 2004 03:40, Kevin A. Burton wrote:
 

Would a HashMap implementation of RAMDirectory beat out a cached
FSDirectory?
   

It's easy to test, so it's worth a try. Please try if the attached patch 
makes any difference for you compared to the current implementation of 
RAMDirectory.

 

True... I was just thinking out loud... was being lazy.  Now I actually 
have to do the work to create the benchmark again... damn you ;)

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



Benchmark of filesystem cache for index vs RAMDirectory...

2004-08-07 Thread Kevin A. Burton
I'm not sure how Solaris or Windows perform but the Linux block cache 
will essentially use all avali memory to buffer the filesystem.

If one is using an FSDirectory to perform searches while the first 
search would be slow, remaining searches would be fast since Linux will 
now buffer the index in RAM.

The RAMDirectory has the advantage of pre-loading everything and can 
keep it in memory if the box is performing other operations.

In my benchmarks though RAMDirectory is slightly slower.  I assume this 
is because its Hashtable based even though it only needs to be 
synchronized in a few places.  IE searches should never be synchronized...

Would a HashMap implementation of RAMDirectory beat out a cached 
FSDirectory?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Performance when computing computing a filter using hundreds of diff terms.

2004-08-05 Thread Kevin A. Burton
I'm trying to compute a filter to match documents in our index by a set 
of terms.

For example some documents have a given field 'category' so I need to 
compute a filter with mulitple categories.

The problem is that our category list is > 200 items so it takes about 
80 seconds to compute.  We cache it of course but this seems WAY too slow.

Is there anything I could do to speed it up?  Maybe run the queries 
myself and then combine the bitsets?

We're using a BooleanQuery with nested TermQueries to build up the 
filter...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Split an existing index into smaller segments without a re-index?

2004-08-04 Thread Kevin A. Burton
Is it possible to take an existing index (say 1G) and break it up into a 
number of smaller indexes (say 10 100M indexes)...

I don't think theres currently an API for this but its certainly 
possible (I think).

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Progress bar for Lucene

2004-07-29 Thread Kevin A. Burton
Hannah c wrote:
Hi,
Is there anything in lucene that would help with the implementation of 
a progress bar. Somewhere I could throw an event that says the search 
is 10%, 20% complete etc.  Or is there already an implementation of a 
progress bar available for lucene.
I would really like to see something like this for index optimizes 
actually.  If an optimized takes 45 minutes its nice to see a progress 
indicator.

Of course I've thought about just doign a disk base IO assumption.  ... 
you just watch the index being created and estimate its target size...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


GROUP BY functionality.

2004-07-27 Thread Kevin A. Burton
In 1.4 we now have arbitrary sort support...
Is it possible to use GROUP BY without having do to this on the client 
(which would be inneficient)...

I have a field I want to make sure is unique in my search results.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-14 Thread Kevin A. Burton
Doug Cutting wrote:
Aviran wrote:
I changed the Lucene 1.4 final source code and yes this is the source
version I changed.

Note that this patch won't produce the a speedup on earlier releases, 
since their was another multi-thread bottleneck higher up the stack 
that was only recently removed, revealing this lower-level bottleneck.

The other patch was:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg07873.html
Both are required to see the speedup.
Thanks...
Also, is there any reason folks cannot use 1.4 final now?
No... just that I'm trying to be conservative... I'm probably going to 
look at just migrating to 1.4 ASAP but we're close to a milestone...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Aviran wrote:
Bug 30058 posted
 

Which of course is here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=30058
Is this the source of the revision you modified?
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html
Also what version of Lucene?
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
I was going to create a new IDField class which just calls super( 
name, value, false, true, false) but noticed I was prevented because 
Field.java is final?

You don't need to subclass to do this, just a static method somewhere.
Why is this? I can't see any harm in making it non-final...

Field and Document are not designed to be extensible. They are 
persisted in such a way that added methods are not available when the 
field is restored. In other words, when a field is read, it always 
constructs an instance of Field, not a subclass.
Thats fine... I think thats acceptable behavior. I don't think anyone 
would assume that inner vars are restored or that the field is serialized.

Not a big deal but it would be nice...
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Field.java -> STORED, NOT_STORED, etc...

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
It would be best to get the compiler to check the order.
If we change this, why not use type-safe enumerations:
http://www.javapractices.com/Topic1.cjp
The calls would look like:
new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES);
Stored could be implemented as the nested class:
public final class Stored {
private Stored() {}
public static final Stored YES = new Stored();
public static final Stored NO = new Stored();
}
+1... I'm not in love with this pattern but since Java < 1.4 doesnt' 
support enum its better than nothing.

I also didn't want to submit a recommendation that would break APIs. I 
assume the old API would be deprecated?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search has poor cpu utilization on a 4-CPU machine

2004-07-12 Thread Kevin A. Burton
Doug Cutting wrote:
I noticed that the class org.apache.lucene.index.FieldInfos uses private
class members Vector byNumber and Hashtable byName, both of which are
synchronized objects. By changing the Vector byNumber to ArrayList 
byNumber
I was able to get 110% improvement in performance (number of searches 
per
second).

That's impressive! Good job finding a bottleneck!
Wow... thats awesome.
We have all dual XEONs with Hyperthreading and kernel 2.6 so I imagine 
in this situation we'd see an improvement too.

I wonder if we could break this out into a patch for legacy Lucene 
users. I'd like to see the stacktrace too.

We're using a lot of synchronized code (Hashtable, Vector, etc) so I'm 
willing to bet this is happening in other places.

My question is: do the fields byNumber and byName have to be 
synchronized
and what can happen if I'll change them to be ArrayList and HashMap 
which
are not synchronized ? Can this corrupt the index or the integrity of 
the
results?

I think that is a safe change. FieldInfos is only modifed by 
DocumentWriter and SegmentMerger, and there is no possibility of other 
threads accessing those instances. Please submit a patch to the 
developer mailing list.

That would be great!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Why is Field.java final?

2004-07-11 Thread Kevin A. Burton
John Wang wrote:
I was running into the similar problems with Lucene classes being
final. In my case the Token class. I sent out an email but no one
responeded :(
 

final is often abused... as is private.
anyway... maybe we can submit a patch :)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Field.java -> STORED, NOT_STORED, etc...

2004-07-11 Thread Kevin A. Burton
I've been working with the Field class doing index conversions between 
an old index format to my new external content store proposal (thus the 
email about the 14M convert).

Anyway... I find the whole Field.Keyword, Field.Text thing confusing.  
The main problem is that the constructor to Field just takes booleans 
and if you forget the ordering of the booleans its very confusing.

new Field( "name", "value", true, false, true );
So looking at that you have NO idea what its doing without fetching javadoc.
So I added a few constants to my class:
new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED );
which IMO is a lot easier to maintain.
Why not add these constants to Field.java:
   public static final boolean STORED = true;
   public static final boolean NOT_STORED = false;
   public static final boolean INDEXED = true;
   public static final boolean NOT_INDEXED = false;
   public static final boolean TOKENIZED = true;
   public static final boolean NOT_TOKENIZED = false;
Of course you still have to remember the order but this becomes a lot 
easier to maintain.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Why is Field.java final?

2004-07-10 Thread Kevin A. Burton
I was going to create a new IDField class which just calls super( name, 
value, false, true, false) but noticed I was prevented because 
Field.java is final?

Why is this?  I can't see any harm in making it non-final...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Increasing Linux kernel open file limits.

2004-07-08 Thread Kevin A. Burton
Don't know if anyone knew this:
http://www.hp-eloquence.com/sdb/html/linux_limits.html
The kernel allocates filehandles dynamically up to a limit specified 
by file-max.

The value in file-max denotes the maximum number of file- handles that 
the Linux kernel will allocate. When you get lots of error messages 
about running out of file handles, you might want to increase this limit.

The three values in file-nr denote the number of allocated file 
handles, the number of used file handles and the maximum number of 
file handles. When the allocated filehandles come close to the 
maximum, but the number of actually used ones is far behind, you've 
encountered a peak in your filehandle usage and you don't need to 
increase the maximum.

So while root you can allocate as many file handles without any limits 
enforced by glibc you still have to fight against the kernel

Just doing a echo 100 > /proc/sys/fs/file-max works fine.
Then I can keep track of my file limit by doing a
cat /proc/sys/fs/file-nr
At least this works on 2.6.x...
Think this is going to save me a lot of headache!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Something sounds very wrong for there to be that many files.
The maximum number of files should be around:
(7 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

With 14M documents, log_10(14M/1000) is 4, which gives, for you:
(7 + numIndexedFields) * 36 = 230k
7*36 + numIndexedFields*36 = 230k
numIndexedFields = (230k - 7*36) / 36 =~ 6k
So you'd have to have around 6k unique field names to get 230k files. 
Or something else must be wrong. Are you running on win32, where file 
deletion can be difficult?

With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.

This is very strange...
I'm going to increase minMergeDocs to 1 and then run the full 
converstion on one box and then try to do an optimize (of the corrupt) 
another box. See which one finishes first.

I assume the speed of optimize() can be increased the same way that 
indexing is increased...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
This is why I think it makes more sense to use our own java.io.tmpdir 
to be on the safe side.

I think the bug is that Tomcat changes java.io.tmpdir. I thought that 
the point of the system property java.io.tmpdir was to have a portable 
name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks 
that. So must Lucene have its own way of finding the platform-specific 
temporary directory that everyone can write to? Perhaps, but it seems 
a shame, since Java already has a standard mechanism for this, which 
Tomcat abuses...
I've seen this done in other places as well. I think Weblogic did/does 
it. I'm wondering what some of these big EJB containsers use which is 
why I brought this up. I'm not sure the problem is just with Tomcat.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
No... I changed the mergeFactor back to 10 as you suggested.

Then I am confused about why it should take so long.
Did you by chance set the IndexWriter.infoStream to something, so that 
it logs merges? If so, it would be interesting to see that output, 
especially the last entry.

No I didn't actually... If I run it again I'll be sure to do this.
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-08 Thread Kevin A. Burton
Otis Gospodnetic wrote:
Hey Kevin,
Not sure if you're aware of it, but you can specify the lock dir, so in
your example, both JVMs could use the exact same lock dir, as long as
you invoke the VMs with the same params.  

Most people won't do this or won't even understand WHY they need to do 
this :-/.

You shouldn't be writing the
same index with more than 1 IndexWriter though (not sure if this was
just a bad example or a real scenario).
 

Yes... I realize that you shouldn't use more than one IndexWriter. That 
was the point. The locks are to prevent this from happening. If one were 
to accidentally do this the locks would be in different directories and 
our IndexWriter would corrupt the index.

This is why I think it makes more sense to use our own java.io.tmpdir to 
be on the safe side.

--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Understanding TooManyClauses-Exception and Query-RAM-size

2004-07-08 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:
Hi,
a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went
smoothly, but we are experiencing some problems with that new constant limit
maxClauseCount=1024
which leeds to Exceptions of type 

	org.apache.lucene.search.BooleanQuery$TooManyClauses 

when certain RangeQueries are executed (in fact, we get this Excpetion when
we execute certain Wildcard queries, too). Although we are working with a
fairly small index with about 35.000 documents, we encounter this Exception
when we search for the property "modificationDate". For example
	modificationDate:[00 TO 0dwc970kw] 

 

We talked about this the other day.
http://wiki.apache.org/jakarta-lucene/IndexingDateFields
Find out what type of precision you need and use that.  If you only need 
days or hours or minutes then use that.   Millis is just too small. 

We're only using days and have queries for just the last 7 days as max 
so this really works out well...

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
So is it possible to fix this index now? Can I just delete the most 
recent segment that was created? I can find this by ls -alt

Sorry, I forgot to answer your question: this should work fine. I 
don't think you should even have to delete that segment.
I'm worried about duplicate or missing content from the original index. 
I'd rather rebuild the index and waste another 6 hours (I've probably 
blown 100 hours of CPU time on this already) and have a correct index :)

During an optimize I assume Lucene starts writing to a new segment and 
leaves all others in place until everything is done and THEN deletes them?

Also, to elaborate on my previous comment, a mergeFactor of 5000 not 
only delays the work until the end, but it also makes the disk 
workload more seek-dominated, which is not optimal. 
The only settings I uses are:
targetIndex.mergeFactor=10;
targetIndex.minMergeDocs=1000;
the resulting index has 230k files in it :-/
I assume this is contributing to all the disk seeks.
So I suspect a smaller merge factor, together with a larger 
minMergeDocs, will be much faster overall, including the final 
optimize(). Please tell us how it goes.

This is what I did for this last round but then I ended up with the 
highly fragmented index.

hm...
Thanks for all the help btw!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Doug Cutting wrote:
Kevin A. Burton wrote:
Also... what can I do to speed up this optimize? Ideally it wouldn't 
take 6 hours.

Was this the index with the mergeFactor of 5000? If so, that's why 
it's so slow: you've delayed all of the work until the end. Indexing 
on a ramfs will make things faster in general, however, if you have 
enough RAM...
No... I changed the mergeFactor back to 10 as you suggested.
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
Peter M Cipollone wrote:
You might try merging the existing index into a new index located on a ram
disk.  Once it is done, you can move the directory from ram disk back to
your hard disk.  I think this will work as long as the old index did not
finish merging.  You might do a "strings" command on the segments file to
make sure the new (merged) segment is not in there, and if there's a
"deletable" file, make sure there are no segments from the old index listed
therein.
 

Its a HUGE index.  It won't fit in memory ;)  Right now its at 8G...
Thanks though! :)
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Way to repair an index broking during 1/2 optimize?

2004-07-08 Thread Kevin A. Burton
So.. the other day I sent an email about building an index with 14M 
documents.

That went well but the optimize() was taking FOREVER.  It took 7 hours 
to generate the whole index and when complete as of 10AM it was still 
optimizing (6 hours later) and I needed the box back.

So is it possible to fix this index now?  Can I just delete the most 
recent segment that was created?  I can find this by ls -alt

Also... what can I do to speed up this optimize?  Ideally it wouldn't 
take 6 hours.

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Lucene shouldn't use java.io.tmpdir

2004-07-07 Thread Kevin A. Burton
As per 1.3 (or was it 1.4) Lucene migrated to using java.iot.tmpdir to 
store the locks for the index.

While under most situations this is save a lot of application servers 
change java.io.tmpdir at runtime.

Tomcat is a good example.  Within Tomcat this property is set to 
TOMCAT_HOME/temp..

Under this situation if I were to create two IndexWriters within two VMs 
and try to write to the same index  the index would get corrupted if one 
Lucene instance was within Tomcat and the other was within a standard VM.

I think we should consider either:
1. Using out own tmpdir property based on the given OS.
2. Go back to the old mechanism of storing the locks within the index 
basedir (if it's not readonly).

Thoughts?
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Most efficient way to index 14M documents (out of memory/file handles)

2004-07-07 Thread Kevin A. Burton
Doug Cutting wrote:
Julien,
Thanks for the excellent explanation.
I think this thread points to a documentation problem. We should 
improve the javadoc for these parameters to make it easier for folks to

In particular, the javadoc for mergeFactor should mention that very 
large values (>100) are not recommended, since they can run into file 
handle limitations with FSDirectory. The maximum number of open files 
while merging is around mergeFactor * (5 + number of indexed fields). 
Perhaps mergeFactor should be tagged an "Expert" parameter to 
discourage folks playing with it, as it is such a common source of 
problems.

The javadoc should instead encourage using minMergeDocs to increase 
indexing speed by using more memory. This parameter is unfortunately 
poorly named. It should really be called something like maxBufferedDocs.
I'd like to see something like this done...
BTW.. I'm willing to add it to the wiki in the interim.
This conversation has happened a few times now...
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Most efficient way to index 14M documents (out of memory/file handles)

2004-07-06 Thread Kevin A. Burton
I'm trying to burn an index of 14M documents.
I have two problems.
1.  I have to run optimize() every 50k documents or I run out of file 
handles.  this takes TIME and of course is linear to the size of the 
index so it just gets slower by the time I complete.  It starts to crawl 
at about 3M documents.

2.  I eventually will run out of memory in this configuration.
I KNOW this has been covered before but for the life of me I can't find 
it in the archives, the FAQ or the wiki. 

I'm using an IndexWriter with a mergeFactor of 5k and then optimizing 
every 50k documents.

Does it make sense to just create a new IndexWriter for every 50k docs 
and then do one big optimize() at the end?

Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Preventing duplicate document insertion during optimize

2004-04-30 Thread Kevin A. Burton
Let's say you have two indexes each with the same document literal.  All 
the fields hash the same and the document is a binary duplicate of a 
different document in the second index.

What happens when you do a merge to create a 3rd index from the first 
two?  I assume you now have two documents that are identical in one 
index.  Is there any way to prevent this?

It would be nice to figure out if there's a way to flag a field as a 
primary key so that if it has already added it to just skip.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Created LockObtainTimedOut wiki page

2004-04-28 Thread Kevin A. Burton
I just created a LockObtainTimedOut wiki entry... feel free to add.  I 
just entered the Tomcat issue with java.io.tmpdir as well.

http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut  

Peace!

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Gus Kormeier wrote:

Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.
I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.
 

Man... you ROCK!  I didn't even THINK of that... Hm... I wonder if we 
should include the name of the lock file in the Exception within 
Tomcat.  That would probably have saved me a lot of time :)

Either that or we can put this in the wiki

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
James Dunn wrote:

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  
 

Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  
 

I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

You can also pass lucene a system property to increase
the lock timeout interval, like so:
-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6
 

I'll give that a try... good idea.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Kevin A. Burton wrote:

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was 
creating a lock.
I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

This is strange... after removing all the webapps (besides 1) Tomcat 
still refuses to allow Lucene to open this index with Lock obtain timed out.

If I open it up from the console it works just fine.  I'm only doing it 
with one index and a ulimit -n so it's not a files issue.  Memory is 1G 
for Tomcat.

If I figure this out will be sure to send a message to the list.  This 
is a strange one

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:

It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.
 

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was creating 
a lock. 

I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than "Lock obtain timed out"

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Does a RAMDirectory ever need to merge segments... (performanceissue)

2004-04-21 Thread Kevin A. Burton
Gerard Sychay wrote:

I've always wondered about this too.  To put it another way, how does
mergeFactor affect an IndexWriter backed by a RAMDirectory?  Can I set
mergeFactor to the highest possible value (given the machine's RAM) in
order to avoid merging segments?
 

Yes... actually I was thinking of increasing these vars on the 
RAMDirectory in the hope to avoid this CPU overhead..

Also I think the var you want is minMergeDocs not mergeFactor.  the only 
problem is that the source to maybeMergeSegments says:

  private final void maybeMergeSegments() throws IOException {
long targetMergeDocs = minMergeDocs;
while (targetMergeDocs <= maxMergeDocs) {
So I guess to prevent this we would have to set minMergeDocs to 
maxMergeDocs+1 ... which makes not sense.  Also by default maxMergeDocs 
is Integer.MAX_VALUE so that will have to be changed.

Anyway... I'm still playing with this myself. It might be easier to just 
use an ArrayList of N documents if you know for sure how big your RAM 
dir will grow to.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Does a RAMDirectory ever need to merge segments... (performance issue)

2004-04-20 Thread Kevin A. Burton
I've been benchmarking our indexer to find out if I can squeeze any more 
performance out of it.

I noticed one problem with RAMDirectory... I'm storing documents in 
memory and then writing them to disk every once in a while. ...

IndexWriter.maybeMergeSegments is taking up 5% of total runtime. 
DocumentWriter.addDocument is taking up another 17% of total runtime.

Notice that this doesn't == 100% becuase there are other tasks taking up 
CPU before and after Lucene is called.

Anyway... I don't see why RAMDirectory is trying to merge segments.  Is 
there anyway to prevent this?  I could just store them in a big 
ArrayList until I'm ready to write them to a disk index but I'm not sure 
how efficient this will be.

Anyone run into this before?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)

2004-04-14 Thread Kevin A. Burton
petite_abeille wrote:

On Apr 13, 2004, at 02:45, Kevin A. Burton wrote:

He mentioned that I might be able to squeeze 5-10% out of index 
merges this way.


Talking of which... what strategy(ies) do people use to minimize 
downtime when updating an index?

This should probably be a wiki page.

Anyway... two thoughts I had on the subject a while back:

You maintain two disk (not RAID ... you get reliability through software).

Searches are load balanced between disks for performance reasons.  If 
one fails you just stop using it.

When you want to do an index merge you read from disk0 and write to 
disk1.  Then you take disk0 out of search rotation and add disk1 and 
copy the contents of disk1 to disk two.  Users shouldn't notice much of 
a performance issue during the merge because it will be VERY fast and 
it's just reads from disk0.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: verifying index integrity

2004-04-12 Thread Kevin A. Burton
Doug Cutting wrote:

If you use this method, it is possible to corrupt things.  In 
particular, if you unlock an index that another process is modifying, 
then modify it, then these two processes might step on one another.  
So this method should only be called when you are certain that no one 
else is modifying the index.

We're handling this by using .pid files.  We use a standard initializer 
and use your own lock files with process IDs.  If you're on UNIX I can 
give you the source to the JNI getpid that I created.  I've been meaning 
on Open Sourcing this anyway... putting it into commons probably.

This way you can prevent multiple initialization if a java process is 
currently running that might be working with your index.  Otherwise 
there's no real way to be sure the lock isn't stale (unless time is a 
factor but that slows things down)

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI

2004-04-12 Thread Kevin A. Burton
Not sure if this is a bug or expected behavior. 

I took Doug's suggestion and migrated to a large BUFFER_SIZE of 1024^2 
.  He mentioned that I might be able to squeeze 5-10% out of index 
merges this way.

I'm not sure if this is expected behavior but this requires a LOT of 
memory.  Without this setting the VM only grows to about 200M ... As 
soon as I enable this my VM will go up to 1.5G and run out of memory 
(which is the max heap).

Our indexes aren't THAT big so I'm not sure if something's wrong here or 
if this is expected behavior.

If this is expected I'm not sure this is valuable.  There are other uses 
for that memory... perhaps just doing the whole merge in memory...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Numeric field data

2004-04-04 Thread Kevin A. Burton
Stephane James Vaucher wrote:

Hi Tate,

There is a solution by Erik that pads numbers in the index. That would 
allow you to search correctly. I'm not sure about decimal, but you could 
always add a multiplier.
 

Wonder if that should go in the FAQ... wiki...

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Performance of hit highlighting and finding term positions for

2004-04-01 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:

730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! 
The 11ms per doc figure in my post was for highlighlighting using a \
lower-case-filter-only analyzer. 5ms of this figure was the cost of the \
lower-case-filter-only analyzer.

73 msecs is the cost of JUST StandardTokenizer (no highlighting)
StandardAnalyzer uses StandardTokenizer so is probably used in a lot of apps. It \
tries to keep certain text eg email addresses as one term. I can live without it and \
I suspect most apps can too. I haven't looked into why its slow but I notice it does \
make use of Vectors. I think a lot of people's highlighter performance issues may \
extend from this.
 

Looking at StandardTokenizer I can't see anything that would slow it 
down much... can we get the source to your lower case fitler?!

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-31 Thread Kevin A. Burton
Erik Hatcher wrote:

No question that it'd be unwise to do.  We could say the same argument 
for making everything public access as well and say it'd be stupid to 
override this method, but we made it public anyway.  I'd rather opt on 
the side of safety.

Besides, you haven't provided a use case for why you need to get the 
searchers back from a MultiSearcher :)

Just ease of use really... I have our MultiSearcher reload transparently 
and this case I can verify that I'm using the right array of searchers 
not one that's already been reloaded behind me.

I can add some code to preserve the original searcher array but it's a pain.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: RE : Performance of hit highlighting and finding term positions for a specific document

2004-03-31 Thread Kevin A. Burton




Rasik Pandey wrote:

  Hello,

  
  
I've been meaning to look into good ways to store token offset
information to allow for very
efficient highlighting and I believe Mark may also be looking
into improving the highlighter via
other means such as temporary ram indexes. Search the archives
to get a background on some of the
idea's we've tossed around ('Dmitry's Term Vector stuff, plus
some' and 'Demoting results' come to
mind as threads that touch this topic).

  
  
I would be nice if CachingRewrittenQueryWrapper.java that I sent to lucene-dev (see below) last week became part of these highlighting effors, if appropriate. We use it to collect terms for a query that searches of multiple indices.
  

Actually I had to write one for my tests with the highlighter. I'm
using a MultiSearcher and a WildcardQuery which the highlighter didn't
have support for.  

My impl was fairly basic so I wouldn't suggest a contribution... I'm
sure your's is better.  The suggested changes to the highlighter for
providing tokens would make this work well together.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc    

    NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster





signature.asc
Description: OpenPGP digital signature


Re: Performance of hit highlighting and finding term positions for

2004-03-31 Thread Kevin A. Burton
Doug Cutting wrote:

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1413989 

According to these, if your documents average 16k, then a 10-hit 
result page would require just 66ms to generate highlights using 
SimpleAnalyzer.
The whole search takes only 300ms... this means that if I highlight 5 
docs I've doubled my search time.

Note that Google has a whole subsection of their cluster dedicated to 
keyword in context extraction.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: RE : Performance of hit highlighting and finding term positions for a specific document

2004-03-31 Thread Kevin A. Burton




Rasik Pandey wrote:

  Kevin,

  
  
http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses
another
Analyzer to find the positions for the result terms.

This seems that it's very inefficient since lucene already
knows the
frequency and position of given terms in the index.

  
  
Can you explain in more detail what you mean here?

It uses the StandardAnalyzer again to re-index to find tokens... when
it finds the same token that matched a search request it highlights it.

It works... just not too efficient.

Kevin
-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster





signature.asc
Description: OpenPGP digital signature


Re: [patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote:

Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:


Could you elaborate on why it makes sense?  What if the caller changed 
a Searchable in the array?  Would anything bad happen?  (I don't know, 
haven't looked at the code).
Yes... something bad could happen... but that would be amazingly stupid 
... we should probably recommend that it be readonly.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
Erik Hatcher wrote:

On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote:

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms.
This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.


What if the original analyzer removed stopped words, stemmed, and 
injected synonyms?
Just use the same analyzer :)... I agree it's not the best approach for 
this reason and the CPU reason.

Also it seems that after all this time that Lucene should have 
efficient hit highlighting as a standard package.  Is there any 
interest in seeing a contribution in the sandbox for this if it uses 
the index positions?


Big +1, regardless of the implementation details.  Hit hilighting is 
so commonly requested that having it available at least in the 
sandbox, or perhaps even in the core, makes a lot of sense. 
Well if we could make it efficient by using the frequency and positions 
of terms we're all set :)... I just need to figure out how to do this 
efficiently per document.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Performance of hit highlighting and finding term positions for a specific document

2004-03-30 Thread Kevin A. Burton
I'm playing with this package:

http://home.clara.net/markharwood/lucene/highlight.htm

Trying to do hit highlighting.  This implementation uses another 
Analyzer to find the positions for the result terms. 

This seems that it's very inefficient since lucene already knows the 
frequency and position of given terms in the index.

My question is whether it's hard to find a TermPosition for a given term 
in a given document rather than the whole index.

IndexReader.termPositions( Term term ) is term specific not term and 
document specific.

Also it seems that after all this time that Lucene should have efficient 
hit highlighting as a standard package.  Is there any interest in seeing 
a contribution in the sandbox for this if it uses the index positions?

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


[patch] MultiSearcher should support getSearchables()

2004-03-30 Thread Kevin A. Burton
Seems to only make sense to allow a caller to find the searchables a 
MultiSearcher was created with:

> 'diff' -uN MultiSearcher.java.bak MultiSearcher.java
--- MultiSearcher.java.bak  2004-03-30 14:57:41.660109642 -0800
+++ MultiSearcher.java  2004-03-30 14:57:46.530330183 -0800
@@ -208,4 +208,8 @@
return searchables[i].explain(query,doc-starts[i]); // dispatch to 
searcher
  }

+  public Searchable[] getSearchables() {
+return searchables;
+  }
+
}
--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster	



signature.asc
Description: OpenPGP digital signature


Re: BooleanQuery$TooManyClauses

2004-03-29 Thread Kevin A. Burton
hui wrote:

Hi,
I have a range query for the date like [20011201 To 20040201], it works fine
for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got
"BooleanQuery$TooManyClauses" exception sometimes no matter the index is
created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related
with maxClauseCount. Is increasing maxClauseCount the only way to avoid this
issue in 1.3 final? The dev mail list has some discussion on the future plan
on this.
 

I've noticed the same problem.. The strange thing is that it only 
happens on some queries.  For example the query "blog" results in this 
exception but the query for "linux" in my index works just fine. 

This is the stacktrace if anyone's interested:

org.apache.lucene.search.BooleanQuery$TooManyClauses
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109)
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101)
   at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240)
   at 
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188)
   at org.apache.lucene.search.Query.weight(Query.java:120)
   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128)
   at 
org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150)
   at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93)
   at org.apache.lucene.search.Hits.(Hits.java:80)
   at org.apache.lucene.search.Searcher.search(Searcher.java:71)

For the record I'm also using a DateRange but I disabled it still noticed the same behavior.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

How long is it taking to merge your 5GB index?  Do you have any stats 
about disk utilization during merge (seeks/second, bytes 
transferred/second)?  Did you try buffer sizes even larger than 1MB? 
Are you writing to a different disk, as suggested?
I'll do some more testing tonight and get back to you

Note that right now this var is final and not public... so that will 
probably need to change.


Perhaps.  I'm reticent to make it too easy to change this.  People 
tend to randomly tweak every available knob and then report bugs, or, 
if it doesn't crash, start recommending that everyone else tweak the 
knob as they do.  There are lots of tradeoffs with buffer size, cases 
that folks might not think of (like that a wildcard query creates a 
buffer for every term that matches), etc.
Or you can do what I do and recompile ;) 

Does it make sense to also increase the OutputStream.BUFFER_SIZE?  
This would seem to make sense since an optimize is a large number of 
reads and writes.


It might help a little if you're merging to the same disk as you're 
reading from, but probably not a lot.  If you're merging to a 
different disk then it shouldn't make much difference at all.

Right now we are merging to the same disk...  I'll perform some real 
benchmarks with this var too.  Long term we're going to migrate to using 
to SCSI disks per machine and then doing parallel queries across them 
with optimized indexes.

Also with modern disk controllers and filesystems I'm not sure how much 
difference this should make.  Both Reiser and XFS do a lot of internal 
buffering as does our disk controller.  I guess I'll find out...

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc    
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Is RangeQuery more efficient than DateFilter?

2004-03-29 Thread Kevin A. Burton
Erik Hatcher wrote:

One more point... caching is done by the IndexReader used for the 
search, so you will need to keep that instance (i.e. the 
IndexSearcher) around to benefit from the caching.

Great... Damn... looked at the source of CachingWrapperFilter and it 
makes sense.  Thanks for the pointer.  The results were pretty amazing.  
Here are the results before and after. Times are in millis:

Before caching the Field:

Searching for Jakarta:
2238
1910
1899
1901
1904
1906
After caching the field:
2253
10
6
8
6
6
That's a HUGE difference :)

I'm very happy :)

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster




signature.asc
Description: OpenPGP digital signature


Re: Tracking/Monitoring Search Terms in Lucene

2004-03-29 Thread Kevin A. Burton
Katie Lord wrote:

I am trying to figure out how to track the search terms that visitors are
using on our site on a monthly basis. Do you all have any suggestions?
 

Don't use lucene for this... just have your form record the search terms.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: Lucene optimization with one large index and numerous small indexes.

2004-03-29 Thread Kevin A. Burton
Doug Cutting wrote:

One way to force larger read-aheads might be to pump up Lucene's input 
buffer size.  As an experiment, try increasing InputStream.BUFFER_SIZE 
to 1024*1024 or larger.  You'll want to do this just for the merge 
process and not for searching and indexing.  That should help you 
spend more time doing transfers with less wasted on seeks.  If that 
helps, then perhaps we ought to make this settable via system property 
or somesuch.

Good suggestion... seems about 10% -> 15% faster in a few strawman 
benchmarks I ran.   

Note that right now this var is final and not public... so that will 
probably need to change.  Does it make sense to also increase the 
OutputStream.BUFFER_SIZE?  This would seem to make sense since an 
optimize is a large number of reads and writes.  

I'm obviously willing to throw memory at the problem

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


  1   2   >