Re: 1.4.x TermInfosWriter.indexInterval not public static ?
Doug Cutting wrote: The default value is probably good for all but folks with very large indexes, who may wish to increase the default somewhat. Also folks with smaller indexes and very high query volumes may wish to decrease the default. It's a classic time/memory tradeoff. Higher values use less memory and make searches a bit slower, smaller values use more memory and make searches a bit faster. BTW.. can you define "a bit"... Is "a bit" 5%? 10%? Benchmarks would be ncie but I'm not that picky. I just want to see what performance hits/benefits I could see by tweaking the values. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
1.4.x TermInfosWriter.indexInterval not public static ?
Whats the desired pattern of using of TermInfosWriter.indexInterval ? Do I have to compile my own version of Lucene to change this? The last API was public static final but this is not public nor static. I'm wondering if we should just make this a value that can be set at runtime. Considering the memory savings for larger installs this can/will be important. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Doug Cutting wrote: Not without hacking things. If your 1.3 indexes were generated with 256 then you can modify your version of Lucene 1.4+ to use 256 instead of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today). Prior to 1.4 this was a constant, hardwired into the index format. In 1.4 and later each index segment stores this value as a parameter. So once 1.4 has re-written your index you'll no longer need a modified version. Thanks for the feedback doug. This makes more sense now. I didn't understand why the website documented the fact that the .tii file was soring the index interval. I think I'm going to investigate just moving to 1.4 ... I need to do it anyway. Might as well bite the bullet now. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ngramj
petite_abeille wrote: On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote: Does anyone know a good tutorial or the javadoc for ngramj because i need it for guessing the language of the documents which should be indexed? http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ languageidentifier/ Wow.. interesting! Where'd this come from? I actually wrote an implementation of NGram language categorization a while back. I'll have to check this out. I'm willing to bet mine's better though ;) I was going to put it in Jakarta Commons... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Doug Cutting wrote: Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. It looks like you're using a pre-1.4 version of Lucene. Since 1.4 this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather TermInfosWriter.indexInterval. Yes... we're trying to be conservative and haven't migrated yet. Though doing so might be required for this move I think... Is this setting incompatible with older indexes burned with the lower value? Prior to 1.4, yes. After 1.4, no. What happens after 1.4? Can I take indexes burned with 256 (a greater value) in 1.3 and open them up correctly with 1.4? Kevin PS. Once I get this working I'm going to create a wiki page documenting this process. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Kevin A. Burton wrote: Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. You know... it looks like the problem is that TermInfosReader uses INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the offsets that I need. I guess I'm thinking out loud here... Looks like the only thing written to the tii index for metainfo is the "size" of the index. Its an int and is the int of the stream (which is reserved). Now I'm curious if there's any other way I can infer this value Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. You know... it looks like the problem is that TermInfosReader uses INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the offsets that I need. If this is going to be a practical way of reducing Lucene memory footprint for HUGE indexes then its going to need a way to change this value based on the current index thats being opened. Is there anyway to determine the INDEX_INTERVAL from the file?It looks according to: http://jakarta.apache.org/lucene/docs/fileformats.html That the .tis file (which according to the docs the .tii file "is very similar to the .tis file" ) should have this data: So according to this: TermInfoFile (.tis)--> TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos The only problem is that the .tii and .tis files I have on disk don't have a constant preamble and doesnt' look like there's an index interval here... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
David Sitsky wrote: On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote: You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports. Or to see Lucene + Derby (100% JAVA embedded database donated from IBM currently in Apache incubation) integrated together... that would be really nice and powerful. Does anyone know if there are any integration plans? Don't forget BerkeleyDB Java Edition... that would be interesting too... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Lucene vs. in-DB-full-text-searching
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Also... MySQL full text search isn't perfect. If you're not a java programmer it would be difficult to hack on. Another downside is that FT in MySQL only works with MyISAM tables which aren't transaction aware and use global tables locks (not fun). I'm sure though that MySQL would do a better job at online index maintenance than Lucene. It falls down a bit in this area... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. The default is 128 but I increased it to 256 and then burned our indexes again and was lucky enough to notice that our memory usage dropped in 1/2. This introduced a bug however where when we try to load our pages before and after we're missing 99% of documents from our index. What happens is that we have a term -> key mapping so that we can pull out documents based on essentially a primary key. The key is just hte URL of the document. With the default value it works fine but when I change it to 256 it cant' find the majority of the documents. In fact its only able to find one. Is this setting incompatible with older indexes burned with the lower value? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Doug Cutting wrote: Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... (Note... when this works I'll note my findings in a wiki page for future developers) Two more questions: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing "target" index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Thanks! Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DbDirectory and Berkeley DB Java Edition...
I'm reading the Lucene in Action book right nowand on page 309 they talk about using the DbDirectory which berkeley DB for maintaining your index. Anyone ever consider a port to Berkeley DB Java Edition? The only downside would be the license (I think its GPL) but it could really free up the time it takes to optimize() I think. You could just rehash the hashtable and then insert rows into the new table. Would be interesting to benchmark I think though. Thoughts? http://www.sleepycat.com/products/je.shtml -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Chris Hostetter wrote: : We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint "settles down" after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. Actually I haven't but to be honest the numbers seem dead on. The VM heap wouldn't reallocate if it didn't need that much memory and this is almost exactly the behavior I'm seeing in product. Though I guess it wouldn't hurt ;) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Paul Elschot wrote: This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. This would be similar to the way the MySQL index cache works... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Opening up one large index takes 940M or memory?
We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( "opening..." ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( "/var/ksa/index-1078106952160/", false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( "opening...done - duration: " + (after-before) ); System.out.println( "totalMemory: " + Runtime.getRuntime().totalMemory() ); System.out.println( "freeMemory: " + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: nable to read TLD "META-INF/c.tld" from JAR file ... standard.jar
Otis Gospodnetic wrote: Most definitely Jetty. I can't believe you're using Tomcat for Rojo! ;) I never said we were using Tomcat for Rojo ;) Sorry about that btw... wrong list! -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
nable to read TLD "META-INF/c.tld" from JAR file ... standard.jar
What in the world is up with this exception? We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if I try to start with a FRESH webapp or try to update any of the JSPs and in-place and recompile I'll get this error: Any idea? I thought maybe the .jar files were corrupt but if I md5sum them they are identical to production and the Tomcat standard dist. Thoughts? org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD "META-INF/c.tld" from JAR file "file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/standard.jar": org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: org.apache.taglibs.standard.tlv.JstlCoreTLV org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:39) org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:405) org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:86) org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:339) org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:372) org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475) org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539) org.apache.jasper.compiler.Parser.parse(Parser.java:126) org.apache.jasper.compiler.ParserController.doParse(ParserController.java:211) org.apache.jasper.compiler.ParserController.parse(ParserController.java:100) org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146) org.apache.jasper.compiler.Compiler.compile(Compiler.java:286) org.apache.jasper.compiler.Compiler.compile(Compiler.java:267) org.apache.jasper.compiler.Compiler.compile(Compiler.java:255) org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:556) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:296) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Reloading LARGE index causes OutOfMemory... intern Terms?
We nighly optimizing one of our main indexes which takes up about 70% of system memory when loaded due to Term objects being stored in memory. We perform the optimization out of process then tell Tomcat to reload its index. This then causes us to open the index again which would need 140% of system memory and so causes an OutOfMemory exception. Whats the best way to handle this? Do I open the index again or is there a better way to tell Lucene that I'm reloading an existing index so it uses less memory? Is it possible to intern the Term objects so that I only have one term per virtual machine instead of one per index? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index Windows' Compiled HTML Help (CHM) Format
Tom wrote: Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). I believe its just a Microsoft .cab file with an index.html inside it... am I right? just uncompress it. The problem is that the HTML within them isn't any way NEAR standard and you can't really give them to the user in the UI... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene in action ebook
Erik Hatcher wrote: I have the e-book PDF in my possession. I have been prodding Manning on a daily basis to update the LIA website and get the e-book available. It is ready, and I'm sure that its just a matter of them pushing it out. There may be some administrative loose ends they are tying up before releasing it to the world. It should be available any minute now, really. :) Send off a link to the list when its out... We're all holding our breath ;) (seriously) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBCDirectory to prevent optimize()?
Erik Hatcher wrote: Also, there is a DBDirectory in the sandbox to store a Lucene index inside Berkeley DB. I assume this would prevent prefix queries from working... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JDBCDirectory to prevent optimize()?
It seems that when compared to other datastores that Lucene starts to fall down. For example lucene doesn't perform online index optimizations so if you add 10 documents you have to run optimize() again and this isn't exactly a fast operation. I'm wondering about the potential for a generic JDBCDirectory for keeping the lucene index within a database. It sounds somewhat unconventional would allow you to perform live addDirectory updates without performing an optimize() again. Has anyone looked at this? How practical would it be. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Also another note is that doing an index merge in memory is probably faster if you just use a RAMDirectory and perform addIndexes to it. This would almost certainly be faster than optimizing on disk but I haven't benchmarked it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Yes... I performed the same benchmark and in my situation RAMDirectory for searches was about 2% slower. I'm willing to bet that it has to do with the fact that its a Hashtable and not a HashMap (which isn't synchronized). Also adding a constructor for the term size could make loading a RAMDirectory faster since you could prevent rehash. If you're on a modern machine your filesystme cache will end up buffering your disk anyway which I'm sure was happening in my situation. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mozilla Desktop Search
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/ The Mozilla foundation may be considering a desktop search implementation <http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10> : Having launched the much-awaited Version 1.0 of the Firefox browser yesterday (see story), The Mozilla Foundation is busy planning enhancements to the open-source product, including the possibility of integrating it with a variety of desktop search tools. The Mozilla Foundation also wants to place Firefox in PCs through reseller deals with PC hardware vendors and continue to sharpen the product's pop-up ad-blocking technology. I'm not sure this is a good idea. Maybe it is though. The technology just isn't there for cross platform search. I'd have to suggest using Lucene but using GCJ for a native compile into XPCOM components but I'm not sure if GCJ is up to the job here. If this approach is possible then I'd be very excited. One advantage to this approach is that an HTTP server wouldn't be necessary since you're already within the brower. Good for everyone involved. No bloated Tomcat causing problem and blazingly fast access within the browser. Also since TCP isn't involved you could gracefully fail when the search service isn't running; you could just start it. -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Lucene external field storage contribution.
About 3 months ago I developed a external storage engine which ties into lucene. I'd like to discuss making a contribution so that this is integrated into a future version of Lucene. I'm going to paste my original PROPOSAL in this email. There wasn't a ton of feedback first time around but I figure squeaky wheel gets the grease... I created this proposal because we need this fixed at work. I want to go ahead and work on a vertical fix for our version of lucene and then submit this back to Jakarta. There seems to be a lot of interest here and I wanted to get feedback from the list before moving forward ... Should I put this in the wiki?! Kevin ** OVERVIEW ** Currently Lucene supports 'stored fields; where the content of these fields are kept within the lucene index for use in the future. While acceptable for small indexes, larger amounts of stored fields prevent: - Fast index merges since the full content has to be continually merged. - Storing the indexes in memory (since a LOT of memory would be required and this is cost prohibitive) - Fast queries since block caching can't be used on the index data. For example in our current setup our index size is 20G. Nearly 90% of this is content. If we could store the content outside of Lucene our merges and searches would be MUCH faster. If we could store the index in MEMORY this could be orders of magnitude faster. ** PROPOSAL ** Provide an external field storage mechanism which supports legacy indexes without modification. Content is stored in a "content segment". The only changes would be a field with 3(or 4 if checksum enabled) values. - CS_SEGMENT Logical ID of the content segment. This is an integer value. There is a global Lucene property named CS_ROOT which stores all the content. The segments are just flat files with pointers. Segments are broken into logical pieces by time and size. Usually 100M of content would be in one segment. - CS_OFFSET The byte offset of the field. - CS_LENGTH The length of the field. - CS_CHECKSUM Optional checksum to verify that the content is correct when fetched from the index. - The field value here would be exactly 'N:O:L' where N is the segment number, O is the offset, and L is the length. O and L are 64bit values. N is a 32 bit value (though 64bit wouldn't really hurt). This mechanism allows for the external storage of any named field. CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO code for efficient content lookup. (Though filehandle caching should probably be used). Since content is broken into logical 100M segments the underlying filesystem can orgnize the file into contiguous blocks for efficient non-fragmented lookup. File manipulation is easy and indexes can be merged by simply concatenating the second file to the end of the first. (Though the segment, offset, and length need to be updated). (FIXME: I think I need to think about this more since I will have < 100M per syncs) Supporting full unicode is important. Full java.lang.String storage is used with String.getBytes() so we should be able to avoid unicode issues. If Java has a correct java.lang.String representation it's possible easily add unicode support just by serializing the byte representation. (Note that the JDK says that the DEFAULT system char encoding is used so if this is ever changed it might break the index) While Linux and modern versions of Windows (not sure about OSX) support 64bit filesystems the 4G storage boundary of 32bit filesystems (ext2 is an example) are an issue. Using smaller indexes can prevent this but eventually segment lookup in the filesystem will be slow. This will only happen within terabyte storage systems so hopefully the developer has migrated to another (modern) filesystem such as XFS. ** FEATURES ** - Must be able to replicate indexes easily to other hosts. - Adding content to the index must be CHEAP - Deletes need to be cheap (these are cheap for older content. Just discard older indexes) - Filesystem needs to be able to optimize storage - Must support UNICODE and binary content (images, blobs, byte arrays, serialized objects, etc) - Filesystem metadata operations should be fast. Since content is kept in LARGE indexes this is easy to avoid. - Migration to the new system from legacy indexes should be fast and painless for future developers -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfbur
Ability to apply document age with the score?
Lets say I have an index with two documents. They both have the same score but one was added 6 months ago and the other was added 2 minutes ago. I want the score adjusted based on the age so that older documents have a lower score. I don't want to sort by document age (date) because if one document is older but has a HIGHER score it would be better to have it rise above newer documents that have a lower score. Is this possible? The only way I could think of doing it would be to have a DateFilter and then apply a dampening after the query. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lots Of Interest in Lucene Desktop
I've made a few passive mentions of my Lucene <http://jakarta.apache.org/lucene> Desktop prototype here on PeerFear in the last few days and I'm amazed how much feedback I've had. People really want to start work on an Open Source desktop search based on Lucene. http://www.peerfear.org/rss/permalink/2004/10/28/LotsOfInterestInLuceneDesktop/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Documents with 1 word are given unfair lengthNorm()
WRT to my blog post: It seems the problem is that the distribution for lengthNorm() starts at 1 and moves down from there. 1.0f would work but HUGE documents would be normalized and so would distort the results. What would you think of using this implementation for lengthNorm: public float lengthNorm( String fieldName, int numTokens ) { int THRESHOLD = 50; int nt = numTokens; if ( numTokens <= THRESHOLD ) ++nt; if ( numTokens > THRESHOLD ) nt -= THRESHOLD; float v = (float)(1.0 / Math.sqrt(nt)); if ( numTokens <= THRESHOLD ) v = 1 - v; return v; } This starts the distribution low... approaches 1.0 when 50 terms are in the document... then asymptotically moves to zero from here on out based on sqrt. For example with values from 1 -> 150 would yield (I'd graph this out but I'm too lazy): 1 - 0.29289323 2 - 0.42264974 3 - 0.5 4 - 0.5527864 5 - 0.5917517 6 - 0.6220355 7 - 0.6464466 8 - 0.666 9 - 0.6837722 10 - 0.69848865 11 - 0.7113249 12 - 0.72264993 13 - 0.73273873 14 - 0.74180114 15 - 0.75 16 - 0.7574644 17 - 0.7642977 18 - 0.7705843 19 - 0.7763932 20 - 0.7817821 21 - 0.7867993 22 - 0.7914856 23 - 0.79587585 24 - 0.8 25 - 0.80388385 26 - 0.8075499 27 - 0.81101775 28 - 0.81430465 29 - 0.81742585 30 - 0.8203947 31 - 0.8232233 32 - 0.82592237 33 - 0.8285014 34 - 0.83096915 35 - 0.833 36 - 0.83560103 37 - 0.83777857 38 - 0.8398719 39 - 0.8418861 40 - 0.84382623 41 - 0.8456966 42 - 0.8475014 43 - 0.84924436 44 - 0.8509288 45 - 0.852558 46 - 0.85413504 47 - 0.85566247 48 - 0.85714287 49 - 0.8585786 50 - 0.859972 51 - 1.0 52 - 0.70710677 53 - 0.57735026 54 - 0.5 55 - 0.4472136 56 - 0.4082483 57 - 0.37796447 58 - 0.35355338 59 - 0.3334 60 - 0.31622776 61 - 0.30151135 62 - 0.28867513 63 - 0.2773501 64 - 0.26726124 65 - 0.2581989 66 - 0.25 67 - 0.24253562 68 - 0.23570226 69 - 0.22941573 70 - 0.2236068 71 - 0.2182179 72 - 0.21320072 73 - 0.2085144 74 - 0.20412415 75 - 0.2 76 - 0.19611613 77 - 0.19245009 78 - 0.18898223 79 - 0.18569534 80 - 0.18257418 81 - 0.1796053 82 - 0.17677669 83 - 0.17407766 84 - 0.17149858 85 - 0.16903085 86 - 0.1667 87 - 0.16439898 88 - 0.16222142 89 - 0.16012815 90 - 0.15811388 91 - 0.15617377 92 - 0.15430336 93 - 0.15249857 94 - 0.15075567 95 - 0.1490712 96 - 0.14744195 97 - 0.145865 98 - 0.14433756 99 - 0.14285715 100 - 0.14142136 101 - 0.14002801 102 - 0.13867505 103 - 0.13736056 104 - 0.13608277 105 - 0.13483997 106 - 0.13363062 107 - 0.13245323 108 - 0.13130644 109 - 0.13018891 110 - 0.12909944 111 - 0.12803689 112 - 0.12700012 113 - 0.12598816 114 - 0.125 115 - 0.12403473 116 - 0.12309149 117 - 0.12216944 118 - 0.12126781 119 - 0.120385855 120 - 0.11952286 121 - 0.11867817 122 - 0.11785113 123 - 0.11704115 124 - 0.11624764 125 - 0.11547005 126 - 0.114707865 127 - 0.11396058 128 - 0.1132277 129 - 0.11250879 130 - 0.1118034 131 - 0. 132 - 0.11043153 133 - 0.10976426 134 - 0.10910895 135 - 0.10846523 136 - 0.107832775 137 - 0.107211255 138 - 0.10660036 139 - 0.10599979 140 - 0.10540926 141 - 0.104828484 142 - 0.1042572 143 - 0.10369517 144 - 0.10314213 145 - 0.10259783 146 - 0.10206208 147 - 0.10153462 148 - 0.101015255 149 - 0.10050378 -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
Daniel Naber wrote: (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. What happens when I do this with an existing index? I don't want to have to rewrite this index as it will take FOREVER If the current behavior is all that happens this is fine... this way I can just get this behavior for new documents that are added. Also... why isn't this the default? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Poor Lucene Ranking for Short Text
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Google Desktop Could be Better
http://www.peerfear.org/rss/permalink/2004/10/15/GoogleDesktopCouldBeBetter/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Prevent Lucene from returning short length text...
I've noticed that Lucene does a very bad job at doing search ranking when text has just a few words in the body. For example if you searched for the word "World" in the following two paragraphs: "Hello World" and "The World is often a dangerous place" The first paragraph wuold probably match. Is there a way I can tweak lucene to return richer content? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
Jiří Kuhn wrote: Hi, I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: $ java -version java version "1.4.2_05" Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) The code you can test is below, there are only 3 iterations for me if I use -Xmx5m, the 4th fails. At least this test seems tied to the Sort API... I removed the sort under Lucene 1.3 and it worked fine... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example
David Spencer wrote: JiÅÃ Kuhn wrote: This doesn't work either! You're right. I'm running under JDK1.5 and trying larger values for -Xmx and it still fails. Running under (Borlands) OptimzeIt shows the number of Terms and Terminfos (both in org.apache.lucene.index) increase every time thru the loop, by several hundred instances each. Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 1.3... I used the JMP debugger and all my memory is taken by Terms and TermInfo... I can trace thru some Term instances on the reference graph of OptimizeIt but it's unclear to me what's right. One *guess* is that maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the problem. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IRC?!
Harald Tijink wrote: I hope your idea isn't to replace this Users List and pull the discussions into the IRC scene. I (and most of us) can not attend to any IRC chat because of work and other priorities. This list gives me the opportunity to keep informed ("involved"). Yup... I want to replace the mailing lists, wiki, website, CVS, and Bugzilla with IRC. And if you can't keep up thats just your fault ;) (joke). Its just another tool ;) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IRC?!
There isn't a Lucene IRC room is there (at least there isn't according to Google)? I just joined #lucene on irc.freenode.net if anyone is interested... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Taurat wrote: Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) Depends on what OS and with what patches... Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... There are some patches to apply to get 3G but only on really modern kernels. I just need to get Athlon systems :-/ Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermInfo using 300M for large index?
I'm trying to do some heap debugging of my application to find a memory leak. Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which consumed 300M ... this is of course for a 40G index. Is this normal and is there any way I can streamline this? We are of course caching the IndexSearchers but I want to reduce the memory footprint... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone avail for Lucene consulting or employment in the SF area?
Hope no one considers this spam ;) We're hiring either someone full-time who has strong experience with Java, Lucene, and Jakarta technologies or someone to act as a consultant working on Lucene for about a month optimizing our search infra. This is for a startup located in downtown SF. Send me an email including your resume (html or text only) and I'll respond with full details. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to remove duplicate documents in sort API?
Paul Elschot wrote: Kevin, On Sunday 05 September 2004 10:16, Kevin A. Burton wrote: I want to sort a result set but perform a group by as well... IE remove duplicate items. Could you be more precise? My problem is that I have two machines... one for searching, one for indexing. The searcher has an existing index. The indexer found an UPDATED document and then adds it to a new index and pushes that new index over to the searcher. The searcher then reloads and when someone performs a search BOTH documents could show up (including the stale document). I can't do a delete() on the searcher because the indexer doesn't have the entire index as the searcher. Therefore I wanted to group by the same document ID but this doesn't seem possible. This should suppress the stale document and prefer the newer doc. Is this possible with the new API? Seems like a huge drawback to lucene right now. In case you can define another field that defines what is a duplicate by having the same value for duplicates, you can use it as one of the SortField's for sorting. I have this duplicate field... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)
It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few fields the object overhead and hashcode overhead would have been less this way. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Possible to remove duplicate documents in sort API?
I want to sort a result set but perform a group by as well... IE remove duplicate items. Is this possible with the new API? Seems like a huge drawback to lucene right now. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Patch for IndexWriter.close which prevents NPE...
I just attached a patch which: 1. prevents multiple close() of an IndexWriter 2. prevents an NPE if the writeLock was null. We have been noticing this from time to time and I haven't been able to come up with a hard test case. This is just a bit of defensive programming to prevent it from happening in the first place. It would happen from time to time without any reliable cause. Anyway... Thanks... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster --- IndexWriter.java.bak.close 2004-09-03 11:27:37.0 -0700 +++ IndexWriter.java2004-09-03 11:32:02.0 -0700 @@ -107,6 +107,11 @@ */ private boolean useCompoundFile = false; + /** + * True when we have closed this IndexWriter + */ + protected boolean isClosed = false; + /** Setting to turn on usage of a compound file. When on, multiple files * for each segment are merged into a single file once the segment creation * is finished. This is done regardless of what directory is in use. @@ -183,15 +188,27 @@ }.run(); } } - + /** Flushes all changes to an index, closes all associated files, and closes the directory that the index is stored in. */ public synchronized void close() throws IOException { + +if ( isClosed ) { + return; +} + flushRamSegments(); ramDirectory.close(); -writeLock.release(); // release write lock + +if ( writeLock != null ) { + // release write lock + writeLock.release(); +} + writeLock = null; directory.close(); +isClosed = true; + } /** Release the write lock, if needed. */ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Benchmark of filesystem cache for index vs RAMDirectory...
Daniel Naber wrote: On Sunday 08 August 2004 03:40, Kevin A. Burton wrote: Would a HashMap implementation of RAMDirectory beat out a cached FSDirectory? It's easy to test, so it's worth a try. Please try if the attached patch makes any difference for you compared to the current implementation of RAMDirectory. True... I was just thinking out loud... was being lazy. Now I actually have to do the work to create the benchmark again... damn you ;) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Benchmark of filesystem cache for index vs RAMDirectory...
I'm not sure how Solaris or Windows perform but the Linux block cache will essentially use all avali memory to buffer the filesystem. If one is using an FSDirectory to perform searches while the first search would be slow, remaining searches would be fast since Linux will now buffer the index in RAM. The RAMDirectory has the advantage of pre-loading everything and can keep it in memory if the box is performing other operations. In my benchmarks though RAMDirectory is slightly slower. I assume this is because its Hashtable based even though it only needs to be synchronized in a few places. IE searches should never be synchronized... Would a HashMap implementation of RAMDirectory beat out a cached FSDirectory? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance when computing computing a filter using hundreds of diff terms.
I'm trying to compute a filter to match documents in our index by a set of terms. For example some documents have a given field 'category' so I need to compute a filter with mulitple categories. The problem is that our category list is > 200 items so it takes about 80 seconds to compute. We cache it of course but this seems WAY too slow. Is there anything I could do to speed it up? Maybe run the queries myself and then combine the bitsets? We're using a BooleanQuery with nested TermQueries to build up the filter... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Split an existing index into smaller segments without a re-index?
Is it possible to take an existing index (say 1G) and break it up into a number of smaller indexes (say 10 100M indexes)... I don't think theres currently an API for this but its certainly possible (I think). Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Progress bar for Lucene
Hannah c wrote: Hi, Is there anything in lucene that would help with the implementation of a progress bar. Somewhere I could throw an event that says the search is 10%, 20% complete etc. Or is there already an implementation of a progress bar available for lucene. I would really like to see something like this for index optimizes actually. If an optimized takes 45 minutes its nice to see a progress indicator. Of course I've thought about just doign a disk base IO assumption. ... you just watch the index being created and estimate its target size... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
GROUP BY functionality.
In 1.4 we now have arbitrary sort support... Is it possible to use GROUP BY without having do to this on the client (which would be inneficient)... I have a field I want to make sure is unique in my search results. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Doug Cutting wrote: Aviran wrote: I changed the Lucene 1.4 final source code and yes this is the source version I changed. Note that this patch won't produce the a speedup on earlier releases, since their was another multi-thread bottleneck higher up the stack that was only recently removed, revealing this lower-level bottleneck. The other patch was: http://www.mail-archive.com/[EMAIL PROTECTED]/msg07873.html Both are required to see the speedup. Thanks... Also, is there any reason folks cannot use 1.4 final now? No... just that I'm trying to be conservative... I'm probably going to look at just migrating to 1.4 ASAP but we're close to a milestone... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: Bug 30058 posted Which of course is here: http://issues.apache.org/bugzilla/show_bug.cgi?id=30058 Is this the source of the revision you modified? http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html Also what version of Lucene? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Doug Cutting wrote: Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it non-final... Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass. Thats fine... I think thats acceptable behavior. I don't think anyone would assume that inner vars are restored or that the field is serialized. Not a big deal but it would be nice... -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java -> STORED, NOT_STORED, etc...
Doug Cutting wrote: It would be best to get the compiler to check the order. If we change this, why not use type-safe enumerations: http://www.javapractices.com/Topic1.cjp The calls would look like: new Field("name", "value", Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } +1... I'm not in love with this pattern but since Java < 1.4 doesnt' support enum its better than nothing. I also didn't want to submit a recommendation that would break APIs. I assume the old API would be deprecated? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Doug Cutting wrote: I noticed that the class org.apache.lucene.index.FieldInfos uses private class members Vector byNumber and Hashtable byName, both of which are synchronized objects. By changing the Vector byNumber to ArrayList byNumber I was able to get 110% improvement in performance (number of searches per second). That's impressive! Good job finding a bottleneck! Wow... thats awesome. We have all dual XEONs with Hyperthreading and kernel 2.6 so I imagine in this situation we'd see an improvement too. I wonder if we could break this out into a patch for legacy Lucene users. I'd like to see the stacktrace too. We're using a lot of synchronized code (Hashtable, Vector, etc) so I'm willing to bet this is happening in other places. My question is: do the fields byNumber and byName have to be synchronized and what can happen if I'll change them to be ArrayList and HashMap which are not synchronized ? Can this corrupt the index or the integrity of the results? I think that is a safe change. FieldInfos is only modifed by DocumentWriter and SegmentMerger, and there is no possibility of other threads accessing those instances. Please submit a patch to the developer mailing list. That would be great! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
John Wang wrote: I was running into the similar problems with Lucene classes being final. In my case the Token class. I sent out an email but no one responeded :( final is often abused... as is private. anyway... maybe we can submit a patch :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Field.java -> STORED, NOT_STORED, etc...
I've been working with the Field class doing index conversions between an old index format to my new external content store proposal (thus the email about the 14M convert). Anyway... I find the whole Field.Keyword, Field.Text thing confusing. The main problem is that the constructor to Field just takes booleans and if you forget the ordering of the booleans its very confusing. new Field( "name", "value", true, false, true ); So looking at that you have NO idea what its doing without fetching javadoc. So I added a few constants to my class: new Field( "name", "value", NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED = false; public static final boolean INDEXED = true; public static final boolean NOT_INDEXED = false; public static final boolean TOKENIZED = true; public static final boolean NOT_TOKENIZED = false; Of course you still have to remember the order but this becomes a lot easier to maintain. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why is Field.java final?
I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? Why is this? I can't see any harm in making it non-final... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Increasing Linux kernel open file limits.
Don't know if anyone knew this: http://www.hp-eloquence.com/sdb/html/linux_limits.html The kernel allocates filehandles dynamically up to a limit specified by file-max. The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. The three values in file-nr denote the number of allocated file handles, the number of used file handles and the maximum number of file handles. When the allocated filehandles come close to the maximum, but the number of actually used ones is far behind, you've encountered a peak in your filehandle usage and you don't need to increase the maximum. So while root you can allocate as many file handles without any limits enforced by glibc you still have to fight against the kernel Just doing a echo 100 > /proc/sys/fs/file-max works fine. Then I can keep track of my file limit by doing a cat /proc/sys/fs/file-nr At least this works on 2.6.x... Think this is going to save me a lot of headache! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Something sounds very wrong for there to be that many files. The maximum number of files should be around: (7 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) With 14M documents, log_10(14M/1000) is 4, which gives, for you: (7 + numIndexedFields) * 36 = 230k 7*36 + numIndexedFields*36 = 230k numIndexedFields = (230k - 7*36) / 36 =~ 6k So you'd have to have around 6k unique field names to get 230k files. Or something else must be wrong. Are you running on win32, where file deletion can be difficult? With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. This is very strange... I'm going to increase minMergeDocs to 1 and then run the full converstion on one box and then try to do an optimize (of the corrupt) another box. See which one finishes first. I assume the speed of optimize() can be increased the same way that indexing is increased... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Doug Cutting wrote: Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks that. So must Lucene have its own way of finding the platform-specific temporary directory that everyone can write to? Perhaps, but it seems a shame, since Java already has a standard mechanism for this, which Tomcat abuses... I've seen this done in other places as well. I think Weblogic did/does it. I'm wondering what some of these big EJB containsers use which is why I brought this up. I'm not sure the problem is just with Tomcat. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last entry. No I didn't actually... If I run it again I'll be sure to do this. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Otis Gospodnetic wrote: Hey Kevin, Not sure if you're aware of it, but you can specify the lock dir, so in your example, both JVMs could use the exact same lock dir, as long as you invoke the VMs with the same params. Most people won't do this or won't even understand WHY they need to do this :-/. You shouldn't be writing the same index with more than 1 IndexWriter though (not sure if this was just a bad example or a real scenario). Yes... I realize that you shouldn't use more than one IndexWriter. That was the point. The locks are to prevent this from happening. If one were to accidentally do this the locks would be in different directories and our IndexWriter would corrupt the index. This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: Hi, a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went smoothly, but we are experiencing some problems with that new constant limit maxClauseCount=1024 which leeds to Exceptions of type org.apache.lucene.search.BooleanQuery$TooManyClauses when certain RangeQueries are executed (in fact, we get this Excpetion when we execute certain Wildcard queries, too). Although we are working with a fairly small index with about 35.000 documents, we encounter this Exception when we search for the property "modificationDate". For example modificationDate:[00 TO 0dwc970kw] We talked about this the other day. http://wiki.apache.org/jakarta-lucene/IndexingDateFields Find out what type of precision you need and use that. If you only need days or hours or minutes then use that. Millis is just too small. We're only using days and have queries for just the last 7 days as max so this really works out well... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. I'm worried about duplicate or missing content from the original index. I'd rather rebuild the index and waste another 6 hours (I've probably blown 100 hours of CPU time on this already) and have a correct index :) During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? Also, to elaborate on my previous comment, a mergeFactor of 5000 not only delays the work until the end, but it also makes the disk workload more seek-dominated, which is not optimal. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has 230k files in it :-/ I assume this is contributing to all the disk seeks. So I suspect a smaller merge factor, together with a larger minMergeDocs, will be much faster overall, including the final optimize(). Please tell us how it goes. This is what I did for this last round but then I ended up with the highly fragmented index. hm... Thanks for all the help btw! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Was this the index with the mergeFactor of 5000? If so, that's why it's so slow: you've delayed all of the work until the end. Indexing on a ramfs will make things faster in general, however, if you have enough RAM... No... I changed the mergeFactor back to 10 as you suggested. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Peter M Cipollone wrote: You might try merging the existing index into a new index located on a ram disk. Once it is done, you can move the directory from ram disk back to your hard disk. I think this will work as long as the old index did not finish merging. You might do a "strings" command on the segments file to make sure the new (merged) segment is not in there, and if there's a "deletable" file, make sure there are no segments from the old index listed therein. Its a HUGE index. It won't fit in memory ;) Right now its at 8G... Thanks though! :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Way to repair an index broking during 1/2 optimize?
So.. the other day I sent an email about building an index with 14M documents. That went well but the optimize() was taking FOREVER. It took 7 hours to generate the whole index and when complete as of 10AM it was still optimizing (6 hours later) and I needed the box back. So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene shouldn't use java.io.tmpdir
As per 1.3 (or was it 1.4) Lucene migrated to using java.iot.tmpdir to store the locks for the index. While under most situations this is save a lot of application servers change java.io.tmpdir at runtime. Tomcat is a good example. Within Tomcat this property is set to TOMCAT_HOME/temp.. Under this situation if I were to create two IndexWriters within two VMs and try to write to the same index the index would get corrupted if one Lucene instance was within Tomcat and the other was within a standard VM. I think we should consider either: 1. Using out own tmpdir property based on the given OS. 2. Go back to the old mechanism of storing the locks within the index basedir (if it's not readonly). Thoughts? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Most efficient way to index 14M documents (out of memory/file handles)
Doug Cutting wrote: Julien, Thanks for the excellent explanation. I think this thread points to a documentation problem. We should improve the javadoc for these parameters to make it easier for folks to In particular, the javadoc for mergeFactor should mention that very large values (>100) are not recommended, since they can run into file handle limitations with FSDirectory. The maximum number of open files while merging is around mergeFactor * (5 + number of indexed fields). Perhaps mergeFactor should be tagged an "Expert" parameter to discourage folks playing with it, as it is such a common source of problems. The javadoc should instead encourage using minMergeDocs to increase indexing speed by using more memory. This parameter is unfortunately poorly named. It should really be called something like maxBufferedDocs. I'd like to see something like this done... BTW.. I'm willing to add it to the wiki in the interim. This conversation has happened a few times now... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Most efficient way to index 14M documents (out of memory/file handles)
I'm trying to burn an index of 14M documents. I have two problems. 1. I have to run optimize() every 50k documents or I run out of file handles. this takes TIME and of course is linear to the size of the index so it just gets slower by the time I complete. It starts to crawl at about 3M documents. 2. I eventually will run out of memory in this configuration. I KNOW this has been covered before but for the life of me I can't find it in the archives, the FAQ or the wiki. I'm using an IndexWriter with a mergeFactor of 5k and then optimizing every 50k documents. Does it make sense to just create a new IndexWriter for every 50k docs and then do one big optimize() at the end? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Preventing duplicate document insertion during optimize
Let's say you have two indexes each with the same document literal. All the fields hash the same and the document is a binary duplicate of a different document in the second index. What happens when you do a merge to create a 3rd index from the first two? I assume you now have two documents that are identical in one index. Is there any way to prevent this? It would be nice to figure out if there's a way to flag a field as a primary key so that if it has already added it to just skip. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Created LockObtainTimedOut wiki page
I just created a LockObtainTimedOut wiki entry... feel free to add. I just entered the Tomcat issue with java.io.tmpdir as well. http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut Peace! -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Gus Kormeier wrote: Not sure if our installation is the same or not, but we are also using Tomcat. I had a similiar problem last week, it occurred after Tomcat went through a hard restart and some software errors had the website hammered. I found the lock file in /usr/local/tomcat/temp/ using locate. According to the README.txt this is a directory created for the JVM within Tomcat. So it is a system temp directory, just inside Tomcat. Man... you ROCK! I didn't even THINK of that... Hm... I wonder if we should include the name of the lock file in the Exception within Tomcat. That would probably have saved me a lot of time :) Either that or we can put this in the wiki Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
James Dunn wrote: Which version of lucene are you using? In 1.2, I believe the lock file was located in the index directory itself. In 1.3, it's in your system's tmp folder. Yes... 1.3 and I have a script that removes the locks from both dirs... This is only one process so it's just fine to remove them. Perhaps it's a permission problem on either one of those folders. Maybe your process doesn't have write access to the correct folder and is thus unable to create the lock file? I thought about that too... I have plenty of disk space so that's not an issue. Also did a chmod -R so that should work too. You can also pass lucene a system property to increase the lock timeout interval, like so: -Dorg.apache.lucene.commitLockTimeout=6 or -Dorg.apache.lucene.writeLockTimeout=6 I'll give that a try... good idea. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Kevin A. Burton wrote: Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. This is strange... after removing all the webapps (besides 1) Tomcat still refuses to allow Lucene to open this index with Lock obtain timed out. If I open it up from the console it works just fine. I'm only doing it with one index and a ulimit -n so it's not a files issue. Memory is 1G for Tomcat. If I figure this out will be sure to send a message to the list. This is a strange one Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
[EMAIL PROTECTED] wrote: It is possible that a previous operation on the index left the lock open. Leaving the IndexWriter or Reader open without closing them ( in a finally block ) could cause this. Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
'Lock obtain timed out' even though NO locks exist...
I've noticed this really strange problem on one of our boxes. It's happened twice already. We have indexes where when Lucnes starts it says 'Lock obtain timed out' ... however NO locks exist for the directory. There are no other processes present and no locks in the index dir or /tmp. Is there anyway to figure out what's going on here? Looking at the index it seems just fine... But this is only a brief glance. I was hoping that if it was corrupt (which I don't think it is) that lucene would give me a better error than "Lock obtain timed out" Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Does a RAMDirectory ever need to merge segments... (performanceissue)
Gerard Sychay wrote: I've always wondered about this too. To put it another way, how does mergeFactor affect an IndexWriter backed by a RAMDirectory? Can I set mergeFactor to the highest possible value (given the machine's RAM) in order to avoid merging segments? Yes... actually I was thinking of increasing these vars on the RAMDirectory in the hope to avoid this CPU overhead.. Also I think the var you want is minMergeDocs not mergeFactor. the only problem is that the source to maybeMergeSegments says: private final void maybeMergeSegments() throws IOException { long targetMergeDocs = minMergeDocs; while (targetMergeDocs <= maxMergeDocs) { So I guess to prevent this we would have to set minMergeDocs to maxMergeDocs+1 ... which makes not sense. Also by default maxMergeDocs is Integer.MAX_VALUE so that will have to be changed. Anyway... I'm still playing with this myself. It might be easier to just use an ArrayList of N documents if you know for sure how big your RAM dir will grow to. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Does a RAMDirectory ever need to merge segments... (performance issue)
I've been benchmarking our indexer to find out if I can squeeze any more performance out of it. I noticed one problem with RAMDirectory... I'm storing documents in memory and then writing them to disk every once in a while. ... IndexWriter.maybeMergeSegments is taking up 5% of total runtime. DocumentWriter.addDocument is taking up another 17% of total runtime. Notice that this doesn't == 100% becuase there are other tasks taking up CPU before and after Lucene is called. Anyway... I don't see why RAMDirectory is trying to merge segments. Is there anyway to prevent this? I could just store them in a big ArrayList until I'm ready to write them to a disk index but I'm not sure how efficient this will be. Anyone run into this before? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
petite_abeille wrote: On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? This should probably be a wiki page. Anyway... two thoughts I had on the subject a while back: You maintain two disk (not RAID ... you get reliability through software). Searches are load balanced between disks for performance reasons. If one fails you just stop using it. When you want to do an index merge you read from disk0 and write to disk1. Then you take disk0 out of search rotation and add disk1 and copy the contents of disk1 to disk two. Users shouldn't notice much of a performance issue during the merge because it will be VERY fast and it's just reads from disk0. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: verifying index integrity
Doug Cutting wrote: If you use this method, it is possible to corrupt things. In particular, if you unlock an index that another process is modifying, then modify it, then these two processes might step on one another. So this method should only be called when you are certain that no one else is modifying the index. We're handling this by using .pid files. We use a standard initializer and use your own lock files with process IDs. If you're on UNIX I can give you the source to the JNI getpid that I created. I've been meaning on Open Sourcing this anyway... putting it into commons probably. This way you can prevent multiple initialization if a java process is currently running that might be working with your index. Otherwise there's no real way to be sure the lock isn't stale (unless time is a factor but that slows things down) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI
Not sure if this is a bug or expected behavior. I took Doug's suggestion and migrated to a large BUFFER_SIZE of 1024^2 . He mentioned that I might be able to squeeze 5-10% out of index merges this way. I'm not sure if this is expected behavior but this requires a LOT of memory. Without this setting the VM only grows to about 200M ... As soon as I enable this my VM will go up to 1.5G and run out of memory (which is the max heap). Our indexes aren't THAT big so I'm not sure if something's wrong here or if this is expected behavior. If this is expected I'm not sure this is valuable. There are other uses for that memory... perhaps just doing the whole merge in memory... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Numeric field data
Stephane James Vaucher wrote: Hi Tate, There is a solution by Erik that pads numbers in the index. That would allow you to search correctly. I'm not sure about decimal, but you could always add a multiplier. Wonder if that should go in the FAQ... wiki... -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Performance of hit highlighting and finding term positions for
[EMAIL PROTECTED] wrote: 730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! The 11ms per doc figure in my post was for highlighlighting using a \ lower-case-filter-only analyzer. 5ms of this figure was the cost of the \ lower-case-filter-only analyzer. 73 msecs is the cost of JUST StandardTokenizer (no highlighting) StandardAnalyzer uses StandardTokenizer so is probably used in a lot of apps. It \ tries to keep certain text eg email addresses as one term. I can live without it and \ I suspect most apps can too. I haven't looked into why its slow but I notice it does \ make use of Vectors. I think a lot of people's highlighter performance issues may \ extend from this. Looking at StandardTokenizer I can't see anything that would slow it down much... can we get the source to your lower case fitler?! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: [patch] MultiSearcher should support getSearchables()
Erik Hatcher wrote: No question that it'd be unwise to do. We could say the same argument for making everything public access as well and say it'd be stupid to override this method, but we made it public anyway. I'd rather opt on the side of safety. Besides, you haven't provided a use case for why you need to get the searchers back from a MultiSearcher :) Just ease of use really... I have our MultiSearcher reload transparently and this case I can verify that I'm using the right array of searchers not one that's already been reloaded behind me. I can add some code to preserve the original searcher array but it's a pain. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: RE : Performance of hit highlighting and finding term positions for a specific document
Rasik Pandey wrote: Hello, I've been meaning to look into good ways to store token offset information to allow for very efficient highlighting and I believe Mark may also be looking into improving the highlighter via other means such as temporary ram indexes. Search the archives to get a background on some of the idea's we've tossed around ('Dmitry's Term Vector stuff, plus some' and 'Demoting results' come to mind as threads that touch this topic). I would be nice if CachingRewrittenQueryWrapper.java that I sent to lucene-dev (see below) last week became part of these highlighting effors, if appropriate. We use it to collect terms for a query that searches of multiple indices. Actually I had to write one for my tests with the highlighter. I'm using a MultiSearcher and a WildcardQuery which the highlighter didn't have support for. My impl was fairly basic so I wouldn't suggest a contribution... I'm sure your's is better. The suggested changes to the highlighter for providing tokens would make this work well together. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Performance of hit highlighting and finding term positions for
Doug Cutting wrote: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=1413989 According to these, if your documents average 16k, then a 10-hit result page would require just 66ms to generate highlights using SimpleAnalyzer. The whole search takes only 300ms... this means that if I highlight 5 docs I've doubled my search time. Note that Google has a whole subsection of their cluster dedicated to keyword in context extraction. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: RE : Performance of hit highlighting and finding term positions for a specific document
Rasik Pandey wrote: Kevin, http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. Can you explain in more detail what you mean here? It uses the StandardAnalyzer again to re-index to find tokens... when it finds the same token that matched a search request it highlights it. It works... just not too efficient. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: [patch] MultiSearcher should support getSearchables()
Erik Hatcher wrote: On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote: Seems to only make sense to allow a caller to find the searchables a MultiSearcher was created with: Could you elaborate on why it makes sense? What if the caller changed a Searchable in the array? Would anything bad happen? (I don't know, haven't looked at the code). Yes... something bad could happen... but that would be amazingly stupid ... we should probably recommend that it be readonly. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Performance of hit highlighting and finding term positions for a specific document
Erik Hatcher wrote: On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote: Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. What if the original analyzer removed stopped words, stemmed, and injected synonyms? Just use the same analyzer :)... I agree it's not the best approach for this reason and the CPU reason. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? Big +1, regardless of the implementation details. Hit hilighting is so commonly requested that having it available at least in the sandbox, or perhaps even in the core, makes a lot of sense. Well if we could make it efficient by using the frequency and positions of terms we're all set :)... I just need to figure out how to do this efficiently per document. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Performance of hit highlighting and finding term positions for a specific document
I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. My question is whether it's hard to find a TermPosition for a given term in a given document rather than the whole index. IndexReader.termPositions( Term term ) is term specific not term and document specific. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
[patch] MultiSearcher should support getSearchables()
Seems to only make sense to allow a caller to find the searchables a MultiSearcher was created with: > 'diff' -uN MultiSearcher.java.bak MultiSearcher.java --- MultiSearcher.java.bak 2004-03-30 14:57:41.660109642 -0800 +++ MultiSearcher.java 2004-03-30 14:57:46.530330183 -0800 @@ -208,4 +208,8 @@ return searchables[i].explain(query,doc-starts[i]); // dispatch to searcher } + public Searchable[] getSearchables() { +return searchables; + } + } -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: BooleanQuery$TooManyClauses
hui wrote: Hi, I have a range query for the date like [20011201 To 20040201], it works fine for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got "BooleanQuery$TooManyClauses" exception sometimes no matter the index is created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related with maxClauseCount. Is increasing maxClauseCount the only way to avoid this issue in 1.3 final? The dev mail list has some discussion on the future plan on this. I've noticed the same problem.. The strange thing is that it only happens on some queries. For example the query "blog" results in this exception but the query for "linux" in my index works just fine. This is the stacktrace if anyone's interested: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188) at org.apache.lucene.search.Query.weight(Query.java:120) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93) at org.apache.lucene.search.Hits.(Hits.java:80) at org.apache.lucene.search.Searcher.search(Searcher.java:71) For the record I'm also using a DateRange but I disabled it still noticed the same behavior. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: How long is it taking to merge your 5GB index? Do you have any stats about disk utilization during merge (seeks/second, bytes transferred/second)? Did you try buffer sizes even larger than 1MB? Are you writing to a different disk, as suggested? I'll do some more testing tonight and get back to you Note that right now this var is final and not public... so that will probably need to change. Perhaps. I'm reticent to make it too easy to change this. People tend to randomly tweak every available knob and then report bugs, or, if it doesn't crash, start recommending that everyone else tweak the knob as they do. There are lots of tradeoffs with buffer size, cases that folks might not think of (like that a wildcard query creates a buffer for every term that matches), etc. Or you can do what I do and recompile ;) Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. It might help a little if you're merging to the same disk as you're reading from, but probably not a lot. If you're merging to a different disk then it shouldn't make much difference at all. Right now we are merging to the same disk... I'll perform some real benchmarks with this var too. Long term we're going to migrate to using to SCSI disks per machine and then doing parallel queries across them with optimized indexes. Also with modern disk controllers and filesystems I'm not sure how much difference this should make. Both Reiser and XFS do a lot of internal buffering as does our disk controller. I guess I'll find out... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Is RangeQuery more efficient than DateFilter?
Erik Hatcher wrote: One more point... caching is done by the IndexReader used for the search, so you will need to keep that instance (i.e. the IndexSearcher) around to benefit from the caching. Great... Damn... looked at the source of CachingWrapperFilter and it makes sense. Thanks for the pointer. The results were pretty amazing. Here are the results before and after. Times are in millis: Before caching the Field: Searching for Jakarta: 2238 1910 1899 1901 1904 1906 After caching the field: 2253 10 6 8 6 6 That's a HUGE difference :) I'm very happy :) -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Tracking/Monitoring Search Terms in Lucene
Katie Lord wrote: I am trying to figure out how to track the search terms that visitors are using on our site on a monthly basis. Do you all have any suggestions? Don't use lucene for this... just have your form record the search terms. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Good suggestion... seems about 10% -> 15% faster in a few strawman benchmarks I ran. Note that right now this var is final and not public... so that will probably need to change. Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. I'm obviously willing to throw memory at the problem -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature