[jira] Updated: (LUCENE-2189) Simple9 (de)compression
[ https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Elschot updated LUCENE-2189: - Attachment: LUCENE-2189a.patch Simple9 encoder/decoder and passing tests. This 2189a patch still has a fixme at the encoder to not use more elements than given. Simple9 (de)compression --- Key: LUCENE-2189 URL: https://issues.apache.org/jira/browse/LUCENE-2189 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-2189a.patch Simple9 is an alternative for VInt. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2189) Simple9 (de)compression
[ https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796111#action_12796111 ] Uwe Schindler commented on LUCENE-2189: --- Just a comment on the switch: As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status 28 (and remove the ) and then only list the raw status values 0..9 in the switch statement. Simple9 (de)compression --- Key: LUCENE-2189 URL: https://issues.apache.org/jira/browse/LUCENE-2189 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-2189a.patch Simple9 is an alternative for VInt. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2189) Simple9 (de)compression
[ https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796111#action_12796111 ] Uwe Schindler edited comment on LUCENE-2189 at 1/4/10 8:11 AM: --- Just a comment on the switch: As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status 28 (and remove the ) and then only list the status values 0..9 in the switch statement. was (Author: thetaphi): Just a comment on the switch: As far as I know: Java switch statements are very fast if there are few cases and these cases are near together and therefore small numbers. I would suggest to not switch on the raw ANDed status, but better instead shift the status 28 (and remove the ) and then only list the raw status values 0..9 in the switch statement. Simple9 (de)compression --- Key: LUCENE-2189 URL: https://issues.apache.org/jira/browse/LUCENE-2189 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-2189a.patch Simple9 is an alternative for VInt. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2189) Simple9 (de)compression
[ https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796112#action_12796112 ] Uwe Schindler commented on LUCENE-2189: --- Here the explanation: [http://java.sun.com/docs/books/jvms/first_edition/html/Compiling.doc.html] Simple9 (de)compression --- Key: LUCENE-2189 URL: https://issues.apache.org/jira/browse/LUCENE-2189 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-2189a.patch Simple9 is an alternative for VInt. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2189) Simple9 (de)compression
[ https://issues.apache.org/jira/browse/LUCENE-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796113#action_12796113 ] Paul Elschot commented on LUCENE-2189: -- About the switch: I had the shift down in there initially, but then I left it out for speed of decoding. I could move the status bits to the lower part so that the shift is not needed at all if that does not affect data decoding. I'll have a look at it. Thanks. Simple9 (de)compression --- Key: LUCENE-2189 URL: https://issues.apache.org/jira/browse/LUCENE-2189 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Paul Elschot Priority: Minor Attachments: LUCENE-2189a.patch Simple9 is an alternative for VInt. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 3.0, 2.9.1, 2.9, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2191) rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader)
rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader) - Key: LUCENE-2191 URL: https://issues.apache.org/jira/browse/LUCENE-2191 Project: Lucene - Java Issue Type: Improvement Components: Analysis Reporter: Robert Muir Priority: Minor in TokenStream there is a reset() method, but the method in Tokenizer used to set a new Reader is called reset(Reader). in my opinion this name overloading creates a lot of confusion, and we see things like reset(Reader) calling reset() even in StandardTokenizer... So I think this would be some work to fulfill all the backwards compatibility, but worth it because when you look at the existing reset(Reader) and reset() code in various tokenizers, or the javadocs for Tokenizer, its pretty confusing and inconsistent. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2079) Further improvements to contrib/benchmark for testing NRT
[ https://issues.apache.org/jira/browse/LUCENE-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-2079: The BG thread priority is not finding its way down to the parallel threads, and is causing nightly build to sometimes hang. I've disabled the testcase for now... Further improvements to contrib/benchmark for testing NRT - Key: LUCENE-2079 URL: https://issues.apache.org/jira/browse/LUCENE-2079 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-2079.patch Some small changes: * Allow specifying a priority for BG threads, after the character; priority increment is + or - int that's added to main thread's priority to set child thread's. For my NRT tests I make the reopen thread +2, the indexing threads +1, and leave searching threads at their default. * Added test case * NearRealTimeReopenTask now reports @ the end the full array of msec of each reopen latency * Added optional breakout of counts by time steps. If you set log.time.step.msec to eg 1000 then reported counts for serial task sequence is broken out by 1 second windows. EG you can use this to measure slowdown over time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods
[ https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2188: -- Attachment: (was: LUCENE-2188.patch) A handy utility class for tracking deprecated overridden methods Key: LUCENE-2188 URL: https://issues.apache.org/jira/browse/LUCENE-2188 Project: Lucene - Java Issue Type: New Feature Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch This issue provides a new handy utility class that keeps track of overridden deprecated methods in non-final sub classes. This class can be used in new deprecations. See the javadocs for an example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2188) A handy utility class for tracking deprecated overridden methods
[ https://issues.apache.org/jira/browse/LUCENE-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2188: -- Attachment: LUCENE-2188.patch New patrch, the previous one had the compare method in wrong order. Fixed docs and Analyzer and tests. I always get totally disturbed when using compareTo() and compare() :-( A handy utility class for tracking deprecated overridden methods Key: LUCENE-2188 URL: https://issues.apache.org/jira/browse/LUCENE-2188 Project: Lucene - Java Issue Type: New Feature Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 3.1 Attachments: LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch, LUCENE-2188.patch This issue provides a new handy utility class that keeps track of overridden deprecated methods in non-final sub classes. This class can be used in new deprecations. See the javadocs for an example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Build failed in Hudson: Lucene-trunk #1052
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1052/changes Changes: [rmuir] LUCENE-2185: add @Deprecated annotations [rmuir] LUCENE-2084: remove Byte/CharBuffer wrapping for collation key generation [rmuir] LUCENE-2034: Refactor analyzer reuse and stopword handling -- [...truncated 29995 lines...] [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-01-04_02-04-49-javadoc.jar [echo] Building remote... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: @todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: @todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/remote/lucene-remote-2010-01-04_02-04-49-javadoc.jar [echo] Building snowball... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.analysis.snowball... [javadoc] Loading source files for package org.tartarus.snowball... [javadoc] Loading source files for package org.tartarus.snowball.ext... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: @todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: @todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/snowball/lucene-snowball-2010-01-04_02-04-49-javadoc.jar [echo] Building spatial... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.spatial.geohash... [javadoc] Loading source files for package org.apache.lucene.spatial.geometry... [javadoc] Loading source files for package org.apache.lucene.spatial.geometry.shape... [javadoc] Loading source files for package org.apache.lucene.spatial.tier... [javadoc] Loading source files for package org.apache.lucene.spatial.tier.projections... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: @todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: @todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/spatial/lucene-spatial-2010-01-04_02-04-49-javadoc.jar [echo] Building spellchecker... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spellchecker [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search.spell... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating
[jira] Created: (LUCENE-2192) Memory Leak
Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD [http-8080-1]: delete _16.cfs IFD [http-8080-1]: delete segments_1j IW 0 [http-8080-1]: commit: done IW 0 [http-8080-1]: at close: _17:c2765 IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7 IW 1 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@1d49559 mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@1990e2d ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1
[jira] Commented: (LUCENE-2181) benchmark for collation
[ https://issues.apache.org/jira/browse/LUCENE-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796181#action_12796181 ] Steven Rowe commented on LUCENE-2181: - {quote} bq. ... these four files don't have Apache2 license declarations in them. We should put a README (or something like it) with these files to indicate the license. Are they really apache license? or derived from wikipedia content?... I don't think we should be putting apache license headers in these files {quote} Hmm, I just assumed that since these files were not (anything even close to) verbatim copies that they were independently licensable new works, but it's definitely more complicated than that... This looks like the place to start where licensing is concerned: http://en.wikipedia.org/wiki/Wikipedia_Copyright My (way non-expert) reading of this is that Wikipedia-derived works (and I'm pretty sure these frequency lists qualify as such) must be licensed under the [Creative Commons Attribution-Share Alike 3.0 Unported license|http://creativecommons.org/licenses/by-sa/3.0/], which does not appear to me to be entirely compatible with the Apache2 license. So I agree with you :) - with the caveat that some form of attribution and a pointer to licensing info should be included with these files. benchmark for collation --- Key: LUCENE-2181 URL: https://issues.apache.org/jira/browse/LUCENE-2181 Project: Lucene - Java Issue Type: New Feature Components: contrib/benchmark Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2181.patch.zip Steven Rowe attached a contrib/benchmark-based benchmark for collation (both jdk and icu) under LUCENE-2084, along with some instructions to run it... I think it would be a nice if we could turn this into a committable patch and add it to benchmark. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #1052
This was the build I killed, because it was hung in contrib/benchmark's TestPerfTasksLogic.testBGSearchThreads. Mike On Mon, Jan 4, 2010 at 8:13 AM, Apache Hudson Server hud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1052/changes Changes: [rmuir] LUCENE-2185: add @Deprecated annotations [rmuir] LUCENE-2084: remove Byte/CharBuffer wrapping for collation key generation [rmuir] LUCENE-2034: Refactor analyzer reuse and stopword handling -- [...truncated 29995 lines...] [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/regex/lucene-regex-2010-01-04_02-04-49-javadoc.jar [echo] Building remote... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-remote/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: �...@todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/remote/lucene-remote-2010-01-04_02-04-49-javadoc.jar [echo] Building snowball... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.analysis.snowball... [javadoc] Loading source files for package org.tartarus.snowball... [javadoc] Loading source files for package org.tartarus.snowball.ext... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-snowball/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: �...@todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/snowball/lucene-snowball-2010-01-04_02-04-49-javadoc.jar [echo] Building spatial... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.spatial.geohash... [javadoc] Loading source files for package org.apache.lucene.spatial.geometry... [javadoc] Loading source files for package org.apache.lucene.spatial.geometry.shape... [javadoc] Loading source files for package org.apache.lucene.spatial.tier... [javadoc] Loading source files for package org.apache.lucene.spatial.tier.projections... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version 1.5.0_14 [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... [javadoc] Generating http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spatial/stylesheet.css... [javadoc] Note: Custom tags that could override future standard tags: �...@todo. To avoid potential overrides, use at least one period character (.) in custom tag names. [javadoc] Note: Custom tags that were not seen: �...@todo, @uml.property [jar] Building jar: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/spatial/lucene-spatial-2010-01-04_02-04-49-javadoc.jar [echo] Building spellchecker... javadocs: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/docs/api/contrib-spellchecker [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package org.apache.lucene.search.spell... [javadoc] Constructing Javadoc information... [javadoc] Standard Doclet version
[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toke Eskildsen updated LUCENE-1990: --- Attachment: ba.zip I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache. There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are: 1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third. 2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast. 3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction. Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable? Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Attachments: ba.zip There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796187#action_12796187 ] Michael McCandless commented on LUCENE-2192: Is there a reader open on the index, when you run the above code (calling IndexWriter.deleteAll)? Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD [http-8080-1]: delete _16.cfs IFD [http-8080-1]: delete segments_1j IW 0 [http-8080-1]: commit: done IW 0 [http-8080-1]: at close: _17:c2765 IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7 IW 1
[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toke Eskildsen updated LUCENE-1990: --- Attachment: LUCENE-1990_PerformanceMeasurements20100104.zip Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1990) Add unsigned packed int impls in oal.util
[ https://issues.apache.org/jira/browse/LUCENE-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Toke Eskildsen updated LUCENE-1990: --- Attachment: (was: ba.zip) Add unsigned packed int impls in oal.util - Key: LUCENE-1990 URL: https://issues.apache.org/jira/browse/LUCENE-1990 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Attachments: LUCENE-1990_PerformanceMeasurements20100104.zip There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in LUCENE-1458 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think load into RAM codecs like the one in TestExternalCodecs could use this too. I'm picturing something very basic like: {code} interface PackedUnsignedLongs { long get(long index); void set(long index, long value); } {code} Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be write once so eg the set could make that an assumption/requirement. And a factory somewhere: {code} PackedUnsignedLongs create(int count, long maxValue); {code} I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great. I don't have time near-term to do this... so if anyone has the itch, please jump! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Near Realtime Search (using a built in RAMDirectory)
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796191#action_12796191 ] Jingkei Ly commented on LUCENE-1313: I've just tried applying this patch to my checked-out version of trunk (revision 895585) but it appears that the PrefixSwitchDirectory class is missing - is there another patch that is needed to get this working? Near Realtime Search (using a built in RAMDirectory) Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 3.1 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Enable near realtime search in Lucene without external dependencies. When RAM NRT is enabled, the implementation adds a RAMDirectory to IndexWriter. Flushes go to the ramdir unless there is no available space. Merges are completed in the ram dir until there is no more available ram. IW.optimize and IW.commit flush the ramdir to the primary directory, all other operations try to keep segments in ram until there is no more space. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796199#action_12796199 ] Ramazan VARLIKLI commented on LUCENE-2192: -- No , Would it affect if it was open ? Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD [http-8080-1]: delete _16.cfs IFD [http-8080-1]: delete segments_1j IW 0 [http-8080-1]: commit: done IW 0 [http-8080-1]: at close: _17:c2765 IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@fb1ba7 IW 1 [http-8080-1]: setInfoStream:
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796200#action_12796200 ] Michael McCandless commented on LUCENE-2186: bq. Is this patch for flex, as it contains CodecUtils and so on? Actually it's intended for trunk; I was thinking this should land before flex (it's a much smaller change, and it's isolated from flex), and so I wrote the CodecUtil/BytesRef basic infrastructure, thinking flex would then cutover to them. {quote} Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? {quote} The intention here is for this (index values) to replace field cache, but not aim (initially at least) to do much more. Ie, it's meant to be a RAM resident (either via explicit slurping-into-RAM or via MMAP). So the SSD or spinning magnets should not be hit on retrieval. If we add an iterator API, I think it should be simpler than the postings API (ie, no seeking, dense (every doc is visited, sequentially) iteration). {quote} It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? {quote} Good point! I agree. Also, we should use BytesRef when reading the payload from TermsEnum. Actually I think Payload, BytesRef, TermRef (in flex) should all eventually be merged; of the three names, I think I like BytesRef the best. With *Enum in flex we can switch to BytesRef. For analysis we should switch PayloadAttribute to BytesRef, and deprecate the methods using Payload? Hmmm... but PayloadAttribute is an interface. {quote} So it looks like with your approach you want to support certain primitive types out of the box, such as byte[], float, int, String? {quote} Actually, all primitive types (ie, byte/short/int/long are included under int, as well as arbitrary bit precision between those primitive types). Because the API uses a method invocation (eg IntSource.get) instead of direct array access, we can hide how many bits are actually used, under the impl. Same is true for float/double (except we can't [easily] do arbitrary bit precision here... just 4 or 8 bytes). {quote} If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? {quote} Right, byte[] is for String, but also for arbitrary (opaque to Lucene) extensibility. The six anonymous (separate package private classes) concrete impls should give good efficiency to fit the different use cases. {quote} The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? {quote} This is compelling (letting Attrs read/write directly), but, I have some questions: * How would the random-access API work? (Attrs are designed for iteration). Eg, just providing IndexInput/Output to the Attr isn't quite enough -- the encoding is sometimes context dependent (like frq writes the delta between docIDs, the symbol table needed when reading/writing deref/sorted). How would I build a random access API on top of that? captureState-per-doc is too costly. What API would be used to write the shared state, ie, to tell the Attr we now are writing the segment, so you need to dump the symbol table. * How would the packed ints work? EG say my ints only need 5 bits. (Attrs are sort of designed for one-value-at-once). * How would the symbol table based encodings (deref, sorted) work? I guess the attr would need to have some state associated with it, and when I first create the attr I need to pass it segment name, Directory, etc, so it opens the right files? * I'm thinking we should still directly support native types, ie, Attrs are there for extensibility beyond native types? * Exposing single attr across a multi reader sounds tricky -- LUCENE-2154 (and, we need this for flex, which is worrying me!). But it sounds like you and Uwe are making some progress on that (using some under-the-hood Java reflection magic)... and this doesn't directly affect this issue, assuming we don't expose this API at the MultiReader level. {quote} Thinking out loud: Could we have then attributes with serialize/deserialize methods for
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796236#action_12796236 ] Michael McCandless commented on LUCENE-2192: An open reader would prevent deletion of the index files... but from your log above, it looks like that's not happening. It's curious because from the log I can see that _17.cfs and _18.cfs are being deleted. Can you run the oal.index.CheckIndex tool on your 3-segment index and post the output? Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD [http-8080-1]: delete _16.cfs IFD [http-8080-1]: delete segments_1j IW
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796251#action_12796251 ] Ramazan VARLIKLI commented on LUCENE-2192: -- Now I am testing it with V3 , result is the same For the first time I create the index files the output as follows Segments file=segments_1r numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 1: name=_1e docCount=2765 compound=true hasProx=true numFiles=1 size (MB)=2.348 diagnostics = {os.version=5.1, os=Windows XP, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields..OK [8 fields] test: field norms.OK [8 fields] test: terms, freq, prox...OK [55843 terms; 505243 terms/docs pairs; 856135 tokens] test: stored fields...OK [2765 total field count; avg 1 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] No problems were detected with this index. The second time is Opening index @ C:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene Segments file=segments_1s numSegments=1 version=FORMAT_DIAGNOSTICS [Lucene 2.9] 1 of 1: name=_1f docCount=2765 compound=true hasProx=true numFiles=1 size (MB)=3.821 diagnostics = {os.version=5.1, os=Windows XP, lucene.version=3.0.0 883080 - 2009-11-22 15:43:58, source=flush, os.arch=x86, java.version=1.6.0_12, java.vendor=Sun Microsystems Inc.} no deletions test: open reader.OK test: fields..OK [8 fields] test: field norms.OK [8 fields] test: terms, freq, prox...OK [55843 terms; 505243 terms/docs pairs; 1712270 tokens] test: stored fields...OK [2765 total field count; avg 1 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] No problems were detected with this index. Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796268#action_12796268 ] Michael McCandless commented on LUCENE-2192: So there seems to be two problems: * The old _X.cfs files are not getting removed * Each _X.cfs file is growing in size, even though you sent it exactly the same docs Is that right? Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD [http-8080-1]: delete _16.cfs IFD [http-8080-1]: delete segments_1j IW 0 [http-8080-1]: commit: done IW 0 [http-8080-1]: at close: _17:c2765 IFD [http-8080-1]: setInfoStream
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796286#action_12796286 ] Ramazan VARLIKLI commented on LUCENE-2192: -- no , The old _X.cfs files are removed correctly but _X.cfs file is growing in size . Actually I tried to remove all _X.cfs files with java io commands but it didn't work. Lucene is keeping everything in memory and adds new document to it . I just want to remind this problem happens as long as within one JVM instance . If you shutdown it , it will start from scratch Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]: deleteCommits: now decRef commit segments_1j IFD
[jira] Commented: (LUCENE-2192) Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796312#action_12796312 ] Michael McCandless commented on LUCENE-2192: OK so it's only the 2nd problem. From your CheckIndex output, the 2nd segment has precisely 2X the number of tokens than the first segment (and the same number of documents and same number of unique terms). Can you double check how you create the Document that you pass to Lucene? Is it possible the field in the Document is just getting added twice? Can you post the code that constructs the document? Memory Leak Key: LUCENE-2192 URL: https://issues.apache.org/jira/browse/LUCENE-2192 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9 Reporter: Ramazan VARLIKLI Hi All , I have been working on a problem with Lucene and now gave up after trying many different possibilites which gives me a feeling that There is a bug on this . The scenario is we have an CMS applicaton into which we add new content every week , instead of updating the index which is a bit tricky, I prefer to delete all index documents and add them again which is straightforward . The problem is Lucene doesn't delete the old data somehow and increase the index size every time during the update . I also profile it with java tools and see that even if I close the IndexWriter class and sent it to Garbage Collector it holds all the docs in the memory . Here is the code I use Directory directory = new SimpleFSDirectory(new File(path)); writer = new IndexWriter(directory, analyzer, false,IndexWriter.MaxFieldLength.LIMITED); writer.deleteAll(); //after adding docs close the indexwriter writer.close(); The above code invoked every time we need to update the index . I tried many different scenario here to overcome the problem which includes physically removing the index directory( see how desperate I am ) , optimizing , flushing, commiting indexwriter, create=true parameter and so on . Here is the index file size during creation. If I shutdown the application and restart it , index size starts with 2,458 which is correct size. Any help will be appreciated _17.cfs 2,458 KB _18.cfs 3,990 KB _19.cfs 5,149 KB here is the Lucene logs during creationg of index files 3 times in a row IFD [http-8080-1]: setInfoStream deletionpolicy=org.apache.lucene.index.keeponlylastcommitdeletionpol...@6649 IW 0 [http-8080-1]: setInfoStream: dir=org.apache.lucene.store.simplefsdirect...@c:\Documents and Settings\rvarlikli\workspace\.metadata\.plugins\org.eclipse.wst.server.core\tmp0\wtpwebapps\Clipbank3.5\lucene autoCommit=false mergepolicy=org.apache.lucene.index.logbytesizemergepol...@3b626c mergescheduler=org.apache.lucene.index.concurrentmergeschedu...@baa6ba ramBufferSizeMB=16.0 maxBufferedDocs=-1 maxBuffereDeleteTerms=-1 maxFieldLength=1 index= IW 0 [http-8080-1]: now flush at close IW 0 [http-8080-1]: flush: segment=_17 docStoreSegment=_17 docStoreOffset=0 flushDocs=true flushDeletes=true flushDocStores=true numDocs=2765 numBufDelTerms=0 IW 0 [http-8080-1]: index before flush IW 0 [http-8080-1]: DW: flush postings as segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: closeDocStore: 2 files to flush to segment _17 numDocs=2765 IW 0 [http-8080-1]: DW: oldRAMSize=7485440 newFlushedSize=2472818 docs/MB=1,172.473 new/old=33.035% IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IFD [http-8080-1]: delete _17.fdx IFD [http-8080-1]: delete _17.tis IFD [http-8080-1]: delete _17.frq IFD [http-8080-1]: delete _17.nrm IFD [http-8080-1]: delete _17.fdt IFD [http-8080-1]: delete _17.fnm IFD [http-8080-1]: delete _17.tii IFD [http-8080-1]: delete _17.prx IFD [http-8080-1]: now checkpoint segments_1j [1 segments ; isCommit = false] IW 0 [http-8080-1]: LMP: findMerges: 1 segments IW 0 [http-8080-1]: LMP: level 6.2247195 to 6.400742: 1 segments IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: CMS: now merge IW 0 [http-8080-1]: CMS: index: _17:c2765 IW 0 [http-8080-1]: CMS: no more merges pending; now return IW 0 [http-8080-1]: now call final commit() IW 0 [http-8080-1]: startCommit(): start sizeInBytes=0 IW 0 [http-8080-1]: startCommit index=_17:c2765 changeCount=5 IW 0 [http-8080-1]: now sync _17.cfs IW 0 [http-8080-1]: done all syncs IW 0 [http-8080-1]: commit: pendingCommit != null IW 0 [http-8080-1]: commit: wrote segments file segments_1k IFD [http-8080-1]: now checkpoint segments_1k [1 segments ; isCommit = true] IFD [http-8080-1]:
[jira] Resolved: (LUCENE-2079) Further improvements to contrib/benchmark for testing NRT
[ https://issues.apache.org/jira/browse/LUCENE-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2079. Resolution: Fixed Fixed, again. Hopefully nightly build no longer hangs on this test! Further improvements to contrib/benchmark for testing NRT - Key: LUCENE-2079 URL: https://issues.apache.org/jira/browse/LUCENE-2079 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 3.1 Attachments: LUCENE-2079.patch Some small changes: * Allow specifying a priority for BG threads, after the character; priority increment is + or - int that's added to main thread's priority to set child thread's. For my NRT tests I make the reopen thread +2, the indexing threads +1, and leave searching threads at their default. * Added test case * NearRealTimeReopenTask now reports @ the end the full array of msec of each reopen latency * Added optional breakout of counts by time steps. If you set log.time.step.msec to eg 1000 then reported counts for serial task sequence is broken out by 1 second windows. EG you can use this to measure slowdown over time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2127) Improved large result handling
[ https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-2127: --- Assignee: Grant Ingersoll Improved large result handling -- Key: LUCENE-2127 URL: https://issues.apache.org/jira/browse/LUCENE-2127 Project: Lucene - Java Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Per http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5, it would be nice to offer some other Collectors that are better at handling really large number of results. This could be implemented in a variety of ways via Collectors. For instance, we could have a raw collector that does no sorting and just returns the ScoreDocs, or we could do as Mike suggests and have Collectors that have heuristics about memory tradeoffs and only heapify when appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2190: --- Attachment: LUCENE-2190.patch Patch attached, adding setNextReader to CustomScoreQuery, and a test case. Also fixed a couple latent test bugs, when run on indexes with more than one segment. CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2190) CustomScoreQuery (function query) is broken (due to per-segment searching)
[ https://issues.apache.org/jira/browse/LUCENE-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796343#action_12796343 ] Michael McCandless commented on LUCENE-2190: Patch applies to 2.9.x. CustomScoreQuery (function query) is broken (due to per-segment searching) -- Key: LUCENE-2190 URL: https://issues.apache.org/jira/browse/LUCENE-2190 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.9, 2.9.1, 3.0, 3.0.1, 3.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9.2, 3.0.1, 3.1 Attachments: LUCENE-2190.patch Spinoff from here: http://lucene.markmail.org/message/psw2m3adzibaixbq With the cutover to per-segment searching, CustomScoreQuery is not really usable anymore, because the per-doc custom scoring method (customScore) receives a per-segment docID, yet there is no way to figure out which segment you are currently searching. I think to fix this we must also notify the subclass whenever a new segment is switched to. I think if we copy Collector.setNextReader, that would be sufficient. It would by default do nothing in CustomScoreQuery, but a subclass could override. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults
[ https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2187: Attachment: scoring.pdf attaching updated document with results for a 4th test collection, on english. for this one BM25 did not fare so well. For the lazy, here are the MAP values: StandardAnalyzer Default Scoring: 0.3837 BM25 Scoring: 0.3580 Improved Scoring: 0.3994 StandardAnalyzer + Porter Default Scoring: 0.4333 BM25 Scoring: 0.4131 Improved Scoring: 0.4515 StandardAnalyzer + Porter + MoreLikeThis (top 5 docs) Default Scoring: 0.5234 BM25 Scoring: 0.5087 Improved Scoring: 0.5474 Note that 0.5572 was the highest performing MAP on this corpus (Microsoft Research) in FIRE 2008: http://www.isical.ac.in/~fire/paper/Udupa-mls-fire2008.pdf improve lucene's similarity algorithm defaults -- Key: LUCENE-2187 URL: https://issues.apache.org/jira/browse/LUCENE-2187 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Robert Muir Fix For: Flex Branch Attachments: scoring.pdf, scoring.pdf First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25. In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term. Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track... Here is what I propose in pseudocode (overriding DefaultSimilarity): {code} @Override public float tf(float freq) { return 1 + (float) Math.log(freq); } @Override public float lengthNorm(String fieldName, int numTerms) { return (float) (1 / ((1 - slope) * pivot + slope * numTerms)); } {code} Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it. These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far. The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all. The BM25-like binary pivot here works better on the test collections I have run, but of course only with the tf modification. I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks) so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for. also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults
[ https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2187: Attachment: scoring.pdf sorry, correct some transposition of axes labels and some grammatical mistakes :) improve lucene's similarity algorithm defaults -- Key: LUCENE-2187 URL: https://issues.apache.org/jira/browse/LUCENE-2187 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Robert Muir Fix For: Flex Branch Attachments: scoring.pdf, scoring.pdf, scoring.pdf First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25. In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term. Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track... Here is what I propose in pseudocode (overriding DefaultSimilarity): {code} @Override public float tf(float freq) { return 1 + (float) Math.log(freq); } @Override public float lengthNorm(String fieldName, int numTerms) { return (float) (1 / ((1 - slope) * pivot + slope * numTerms)); } {code} Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it. These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far. The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all. The BM25-like binary pivot here works better on the test collections I have run, but of course only with the tf modification. I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks) so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for. also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults
[ https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-2187: Attachment: LUCENE-2187.patch attached is a patch with the Similarity impl. of course you have to manually supply this pivot value (avg doc. length), for now. improve lucene's similarity algorithm defaults -- Key: LUCENE-2187 URL: https://issues.apache.org/jira/browse/LUCENE-2187 Project: Lucene - Java Issue Type: Improvement Components: Query/Scoring Reporter: Robert Muir Fix For: Flex Branch Attachments: LUCENE-2187.patch, scoring.pdf, scoring.pdf, scoring.pdf First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25. In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term. Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track... Here is what I propose in pseudocode (overriding DefaultSimilarity): {code} @Override public float tf(float freq) { return 1 + (float) Math.log(freq); } @Override public float lengthNorm(String fieldName, int numTerms) { return (float) (1 / ((1 - slope) * pivot + slope * numTerms)); } {code} Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it. These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far. The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all. The BM25-like binary pivot here works better on the test collections I have run, but of course only with the tf modification. I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks) so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for. also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
back_compat folders in tags when I SVN update
Hi folks, Why do I see \java\tags\lucene_*_back_compat_tests_2009*\ directories (well over 100 so far) when I SVN update? Thanks. -- George - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Hudson build is back to normal: Lucene-trunk #1053
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/1053/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org