Re: lucene and solr trunk
what other libraries do is have a 'core' or a 'common' bit.. which is what the lucene library really is. looking at http://svn.apache.org/repos/asf/lucene/ today I see that nearly, but it's called 'java'. maybe just renaming 'java' to 'core' or 'common' (hadoop uses common) might make sense and let ivy or maven be responsible for pulling the other parts. as a weekend developer, I would just pull the bit I care about, and let ivy or maven get the other bits for me. btw.. having a master 'pom.xml' in http://svn.apache.org/repos/asf/lucene/ could just include the the module pom's and build them without having to have nightly jars etc. as for the goal of doing single commits, I've noticed that most of the discussion has been in the format of /lucene/XYZ/trunk/... and /lucene/ABC/trunk if this is one code base, would it make sense to have it: /lucene/trunk/ABC /lucene/trunk/XYZ ? On 3/18/10 11:33 AM, Chris Hostetter wrote: : build and nicely gets all dependencies to Lucene and Tika whenever I build : or release, no problem there and certainly no need to have it merged into : Lucene's svn! The key distinction is that Solr is allready in Lucene's svn -- The question is how reorg things in a way that makes it easier to build Solr and Lucene-Java all at once, while wtill making it easy to build just Lucene-Java. : Professionally i work on a (world-class) geocoder that also nicely depends : on Lucene by using maven, no problems there at all and no need to merge : that code in Lucene's svn! Unless maven has some features i'm not aware of, your nicely depends works buy pulling Lucene jars from a repository -- changing Solr to do that (instead of having committed jars) would be farrly simple (with or w/o maven), but that's not the goal. The goal is to make it easy to build both at once, have patches that update both, and (make it easy to) have atomic svn commits that touch both. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846778#action_12846778 ] Uwe Schindler commented on LUCENE-2326: --- What did you to for this to happen? You can only reproduce this (and this was also possible with your previous setup), if you go onto the data folder and update there. If you update from top-level (outside the data folder), it works always. Maybe the problem lies in the fact, that you had the data already checked out before our reorganisation (from previous test runs). Can you simply delete the data folder with a OS' rm and update again? Maybe it was a problem with svn server? Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326.patch, LUCENE-2326.patch As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846798#action_12846798 ] Michael McCandless commented on LUCENE-2323: {quote} Until code in contrib is to a certain degree of maturity, I feel we should organize it by functionality. Its easy for the users, and it invites the sort of refactoring and cleanup that some of this code needs. {quote} +1 reorganize contrib modules -- Key: LUCENE-2323 URL: https://issues.apache.org/jira/browse/LUCENE-2323 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir it would be nice to reorganize contrib modules, so that they are bundled together by functionality. For example: * the wikipedia contrib is a tokenizer, i think really belongs in contrib/analyzers * there are two highlighters, i think could be one highlighters package. * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846804#action_12846804 ] Michael McCandless commented on LUCENE-2312: {quote} Can't we simply throw away the doc writer after a successful segment flush (the IRs would refer to it, however once they're closed, the DW would close as well)? {quote} I think that should be our first approach. It means no pooling whatsoever. And it means that an app that doesn't aggressively close its old NRT readers will consume more RAM. Though... the NRT readers will be able to search an active DW right? Ie, it's only when that DW needs to flush, when the NRT readers would be tying up the RAM. So, when a flush happens, existing NRT readers will hold a reference to that now-flushed DW, but when they reopen they will cutover to the on-disk segment. I think this will be an OK limitation in practice. Once NRT readers can search a live (still being written) DW, flushing of a DW will be a relatively rare event (unlike today where we must flush every time an NRT reader is opened). Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: 3.1 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
Unless maven has some features i'm not aware of, your nicely depends works buy pulling Lucene jars from a repository The 'missing feature' is called multi-module projects. On Thu, Mar 18, 2010 at 03:33, Chris Hostetter hossman_luc...@fucit.org wrote: : build and nicely gets all dependencies to Lucene and Tika whenever I build : or release, no problem there and certainly no need to have it merged into : Lucene's svn! The key distinction is that Solr is allready in Lucene's svn -- The question is how reorg things in a way that makes it easier to build Solr and Lucene-Java all at once, while wtill making it easy to build just Lucene-Java. : Professionally i work on a (world-class) geocoder that also nicely depends : on Lucene by using maven, no problems there at all and no need to merge : that code in Lucene's svn! Unless maven has some features i'm not aware of, your nicely depends works buy pulling Lucene jars from a repository -- changing Solr to do that (instead of having committed jars) would be farrly simple (with or w/o maven), but that's not the goal. The goal is to make it easy to build both at once, have patches that update both, and (make it easy to) have atomic svn commits that touch both. -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846807#action_12846807 ] Michael McCandless commented on LUCENE-2329: This would be great! But, note that term vectors today do not store the term char[] again -- they piggyback on the term char[] already stored for the postings. Though, I believe they store int textStart (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term), so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs... Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-2328: --- Fix Version/s: 3.1 Anyone wanna cons up a patch here...? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: IndexWriter.synced field accumulates data
Thanks! Mike On Wed, Mar 17, 2010 at 3:16 PM, Gregor Kaczor gkac...@gmx.de wrote: followup in https://issues.apache.org/jira/browse/LUCENE-2328 Original-Nachricht Datum: Wed, 17 Mar 2010 14:30:25 -0500 Von: Michael McCandless luc...@mikemccandless.com An: java-dev@lucene.apache.org Betreff: Re: IndexWriter.synced field accumulates data You're right! Really we should delete from sync'd when we delete the files. We need to tie into IndexFileDeleter for that, maybe moving this set into there. Though in practice the amount of actual RAM used should rarely be an issue? But we should fix it... Can you open an issue? Mike On Wed, Mar 17, 2010 at 1:15 PM, Gregor Kaczor gkac...@gmx.de wrote: I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2320: --- Attachment: LUCENE-2320.patch Fixed a copy-paste comment error in IndexWriter (introduced in LUCENE-2294). Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846813#action_12846813 ] Michael McCandless commented on LUCENE-2327: This exception looks like index corruption... would be good to get to the root cause of how this happened. Your terms dict, which records the field number and character data for each term, has somehow recorded a field number of 52 when in fact this segment appears to only have 4 fields. Can you run CheckIndex on the index and post the result back? Any prior exceptions when creating this index? I don't think adding a bounds check to FieldInfos makes sense -- the best we could do is throw a FieldNumberOutOfBounds exception. IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Changes in SVN (backwards-compatibility branch removed, Snowball test data)
Hi all, Yesterday morning I committed https://issues.apache.org/jira/browse/LUCENE-2326 - If you currently have checked out Lucene repositories, I recommend to do the following: (1) Check, if you have changes in your backwards folder; if yes, create a patch (use svn diff inside the branch checkout, so inside backwards/lucene_3_0_back_compatibility_tests) (2a) If you not have updated svn to HEAD: - run ant clean-backwards, if this fails you are already on HEAD and this task has gone, use (2b) - rm -rf contrib/analyzers/common/src/test/org/apache/lucene/analysis/snowball/data - svn up (2b) If you are already on HEAD: - rm -rf backwards/lucene* - rm -rf contrib/analyzers/common/src/test/org/apache/lucene/analysis/snowball/data - svn up (3) If applicable apply the patch of your changes on the folder using patch -p0 inside backwards, not backwards/src. (4) check everything is correct: - backwards should only contain a readme and a src/ folder - during svn up it should print also a message, that the external snowball data is updated to/on rev 500 (currently). In future, there is no need to have revision numbers or separate commits for changing backwards tests. Just edit in your local checkout and commit in one go. It's also possible for changes to be included in patches, as its now only one checkout. After releasing a new Lucene version, proceed as described in the ReleaseToDo on Wiki, to update the backwards folder. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846821#action_12846821 ] Shai Erera commented on LUCENE-2328: Would that mean removing files from synced whenever 'deleter' (which is an IndexFileDeleter) calls delete*? Are there other places to look for? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Mon, Mar 15, 2010 at 7:49 PM, Marvin Humphrey mar...@rectangular.com wrote: On Mon, Mar 15, 2010 at 05:28:33AM -0500, Michael McCandless wrote: I mean specifically one should not have to commit to the precise scoring model they will use for a given field, when they index that field. Yeah, I've never seen committing to a precise scoring model at index-time via Sim choice as a big deal. In Lucy, per-field Similarity assignments are part of the the Schema, which has to be set at index-time. And index-time Sim choice is the way things have always been done in Lucene. OK. It's new territory -- I haven't heard of users doing lots of scoring experimentation with Lucene. But, then, it's not easy to do now, so... chicken egg. Also, will Lucy store the original stats? Ie so the chosen Sim can properly recompute all boost bytes (if it uses those), for scoring models that pivot based on avg's of these stats? In any case, the proposal to start delaying Sim choice to search-time -- while a nice feature for Lucene -- is a non-starter for Lucy. We can't do that because it would kill the cheap-Searcher model to generate boost bytes at Searcher construction time and cache them within the object. We need those boost bytes written to disk so we can mmap them and share them amongst many cheap Searchers. It'd seem like Lucy could re-gen the boost bytes if a different Sim were selected, or, the current Sim hadn't yet computed cached its bytes? But then logically this means a reader needs write permission to the index dir, which is not good... So... you're proposing shrinking Similarity's public API by removing functionality that Lucy can't live without. If indeed that works out for Lucene, the role of Similarity within the two libraries will have to diverge. In Lucene, Similarity will get smaller; in Lucy it will expand a bit. Yes. To my mind, these are all related data reduction tasks: * Omit doc-boost and field-boost, replacing them with a single float docXfield multiplier -- because you never need doc-boost on its own. * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost, replacing them all with a single boost byte -- because for the kind of scoring you want to do, you don't need all those raw stats. * Omit the boost byte, because you don't need to do scoring at all. * Omit positions because you don't need PhraseQueries, etc. to match. I wouldn't group this one with the others -- I mean technically it is data reduction -- but omitting positions means certain queries (PhraseQuery) won't work even in match only searching. Whereas the rest of these examples affect how scoring is done (or whether it's done). * Omit everything except doc-id, because you only need binary matching. What al those tasks all have in common is that we can determine what stats are disposable based on how the user describes how they are going to use the field. For Lucy, the user is going to have to commit to a precise scoring model at index-time by specifying a Sim choice anyway. Right. If that Sim turns out to be a MatchSimilarity, why on earth should we keep around the boost bytes? Well maybe some queries do scoring on the field and some don't... And what class other than Similarity knows enough about the scoring algorithm to perform these data reduction tasks? If it's not goint to be Similarity itself, it has to be something that know absolutely everything about the Similarity implementation's scoring model. I don't follow this... It will be Sim that does computes norm bytes. I meant that if you're writing out boost bytes, there's no sensible way to execute the lossy data reduction and reduce the index size other than having Sim do it. Right Sim is the right class to do this. Heck one could even use boost nibbles... or, use float. This is an impl detail of the Sim class. class MySim extends Similarity { public PostingCodec makePostingCodec() { StandardPostingCodec codec = new StandardPostingCodec(); codec.setOmitBoostBytes(true); codec.setOmitPositions(true); return (PostingCodec)codec; } } This still feels like you are mixing two very different concepts -- what's being written (boost bytes, positions, docTermFreqs) vs how it's encoded (codec). So StandardPostingCodec shouldn't have methods like setOmitBoostBytes()? Maybe that's right. Guess I'll watch to see how flex pans out and what methods you put on those PostingCodec classes. Yeah I see that (setOmitBoostBytes) part of the field's type. It's like precisionStep for a numeric field, or omitTF/P. Any codec should respect these. For now, I just want to make the no-boost-bytes and doc-id-only index optimizations available, and to achieve that, it's sufficient to implement format-follows-sim and publish MatchSimilarity and MinimalSimilarity. The PostingCodec API can remain a private implementation detail until a later
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846825#action_12846825 ] Michael McCandless commented on LUCENE-2328: Yes I think that's it. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: How can I use QueryScorer() to find only perfect matches??
Hi Erick, I did as recommended and changed the query approprietly. But the result is still the same. On page 78 in the book lucene in action it is explained how scoring is working. Therefore I get more results than the exact match I was expecting. But how can I highlight in a large document only the results identified by a certain query like +contents:term +contents:query? Are there any alternatives to the QueryScore method? any examples? any papers to read first? thx christian Erick Erickson wrote: Try +contents:term +contents:query. By misplacing the '+' you're getting the default OR operator and the '+' is probably being thrown away by the analyzer. Luke will help here a lot. HTH Erick On Mon, Mar 15, 2010 at 9:46 AM, christian stadler stadler.christ...@web.de wrote: Hi there, I have an issue with the QueryScorer(query) method at the moment and I need some assistance. I was indexing my e-book lucene in action and based on this index-db I started to play around with some boolean queries like: (contents:+term contents:+query) As a result I'm expecting as a perfect match for the phrase term query four hits. But when I run my sample to highlight this phrase in the context then I get a lot more results. It also finds all the matches for term and query independently. I think the problem is the QueryScorer() which softens the former exact boolean query. Then I was trying the following: private static Highlighter GetHits(Query query, Formatter formatter) { string filed = contents BooleanQuery termsQuery = new BooleanQuery(); WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field); foreach (WeightedTerm term in terms) { TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm())); termsQuery.Add(termQuery, BooleanClause.Occur.MUST); } // create query scorer based on term queries (field specific) QueryScorer scorer = new QueryScorer(termsQuery); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.SetTextFragmenter(new SimpleFragmenter(20)); return highlighter; } to rewrite the query and set the term attribute from SHOULD to MUST But the result was the same. Do you have any example how I can use the QueryScorer() in exactly the same way as to mimic a BooleanSearch?? thanks in advance Christian - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- View this message in context: http://old.nabble.com/How-can-I-use-QueryScorer%28%29-to-find-only-perfect-matches---tp27904831p27943914.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846835#action_12846835 ] Earwin Burrfoot commented on LUCENE-2328: - A shot in the sky (didn't delve deep into the problem, could definetly miss stuff): What about tracking 'syncidness' from within Directory? There shouldn't be more than one writer anyway (unless your locking is broken), so that's a single set of 'files-to-be-synced' for each given moment of time. Might as well keep track of it inside the directory, and have a syncAllUnsyncedGuys() on it. This will also remove the need to transfer that list around when transferring write lock (IR hell). And all-round that sounds quite logical, as the need/method of syncing depends solely on directory. If you're working with RAMDirectory, you don't need to keep track of these files at all. Probably same for some of DB impls. Also some filesystems sync everything, when you ask to sync a single file, so if you're syncing a batch of them in a row, that's some overhead that you can theoretically work around with a special flag to FSDir. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2302: -- Attachment: LUCENE-2302.patch Updated patch for current flex HEAD. Still backwards needs to be fixed. How do we want to preceed?: - Name of new Attrubute? - Is new CharSeq/Appendable API fine - setEmpty()? Thanks for reviewing! Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable) Key: LUCENE-2302 URL: https://issues.apache.org/jira/browse/LUCENE-2302 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: Flex Branch Reporter: Uwe Schindler Fix For: Flex Branch Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array. Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence I propose to create a new interface CharTermAttribute with a clean new API that concentrates on CharSequence and Appendable. The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute. To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846852#action_12846852 ] Robert Muir commented on LUCENE-2326: - bq. What did you to for this to happen? Uwe, the problem happened to Mark... and this test data has *always* been rev 500. svn.exe simply got the wrong revision. Its probably a bug in svn, I don't think you did anything wrong. But at the same time, we don't want random test failures. Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326.patch, LUCENE-2326.patch As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846860#action_12846860 ] Uwe Schindler commented on LUCENE-2326: --- Man, I reverted the snowball part. Lets change to a zip file as the tests will never change. This svn in build.xml is too much dependent on your local installation of svn tools. I dont like it. Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326.patch, LUCENE-2326.patch As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java
E, let's strive for slightly better commit messages ;-) -Yonik On Thu, Mar 18, 2010 at 7:48 AM, uschind...@apache.org wrote: Author: uschindler Date: Thu Mar 18 11:48:11 2010 New Revision: 924731 URL: http://svn.apache.org/viewvc?rev=924731view=rev Log: LUCENE-2326: As rmuir seems to bug me about that, i reverted the externals def here. In future, lets use a zip file. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846863#action_12846863 ] Robert Muir commented on LUCENE-2326: - bq. Lets change to a zip file as the tests will never change I agree, but this zip file will be pretty large! Thanks for temporarily changing it to do the checkout instead Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326.patch, LUCENE-2326.patch As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java
I am currently unhappy on lucene because of: - LuSolr - communication differences - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: Yonik Seeley [mailto:ysee...@gmail.com] Sent: Thursday, March 18, 2010 12:51 PM To: java-dev@lucene.apache.org Subject: Re: svn commit: r924731 - in /lucene/java/trunk/contrib/analyzers/common: build.xml src/test/org/apache/lucene/analysis/snowball/ src/test/org/apache/lucene/analysis/snowball/TestSnowballVocab.java E, let's strive for slightly better commit messages ;-) -Yonik On Thu, Mar 18, 2010 at 7:48 AM, uschind...@apache.org wrote: Author: uschindler Date: Thu Mar 18 11:48:11 2010 New Revision: 924731 URL: http://svn.apache.org/viewvc?rev=924731view=rev Log: LUCENE-2326: As rmuir seems to bug me about that, i reverted the externals def here. In future, lets use a zip file. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846872#action_12846872 ] Michael McCandless commented on LUCENE-2328: I like this idea! But, we don't want to simply sync all new files. When IW commits, it's possibly a subset of all new files. EG running merges (or any still-open files) should not be sync'd. Not necessarily all closed files should be sync'd either -- eg any files that were opened closed while we were syncing (since syncing can take some time) should not then be sync'd. Maybe we change Dir.sync to take a CollectionString? Then dir would be the one place that keeps track of what's already been sync'd and what hasn't. Or... I wonder if calling sync on a file that's already been sync'd is really that wasteful... I mean it's technically a no-op, so it's just the overhead of a no-op system call from way up in javaland. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846877#action_12846877 ] Michael McCandless commented on LUCENE-2302: I like the name CharTermAttribute. How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the To and Ref). Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable) Key: LUCENE-2302 URL: https://issues.apache.org/jira/browse/LUCENE-2302 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: Flex Branch Reporter: Uwe Schindler Fix For: Flex Branch Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array. Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence I propose to create a new interface CharTermAttribute with a clean new API that concentrates on CharSequence and Appendable. The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute. To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846880#action_12846880 ] Earwin Burrfoot commented on LUCENE-2328: - EG running merges (or any still-open files) should not be sync'd. Files that are still being written should not be synced, that's kinda obvious. Not necessarily all closed files should be sync'd either - eg any files that were opened closed while we were syncing (since syncing can take some time) should not then be sync'd. This one is not so obvious. I assume that on calling syncEveryoneAndHisDog() you should sync all files that have been written to, and were closed, and not yet deleted. Maybe we change Dir.sync to take a CollectionString? What does that alone give us over the current situation? You can call Dir.sync() repeatedly, it's all the same. Or... I wonder if calling sync on a file that's already been sync'd is really that wasteful... It can be on these systems, that just sync down everything. I don't believe in people writing good software : } IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: How can I use QueryScorer() to find only perfect matches??
Unfortunately, highlighter (and I think also fast vector highlighter) are able to return a set of fragments which do not match the query (eg, they only show one of the two required terms). I really don't like that they do this. Ideally (to me) the entire excerpt (ie, all fragments appended together) should match the original query. Meaning I see at least one occurrence of each required term (the occurrence of each could occur in different fragments). Progress has been made in general -- eg it use to be the case that if you highlighted a phrase query, eg president obama, you could see excerpts that only had one of the words. That's been fixed by defaulting to QueryScorer. To really fix this for all queries is not easy... there was a long discussion, here: https://issues.apache.org/jira/browse/LUCENE-1522 I think we should improve the Scorer API so that it can optionally provide positional details of all matches, probably by absorbing Span*Query back into their non-span counterparts and enriching the API. But this is a biggish change. Maybe as a stopgap you could pull many fragments from highlighter and then pick a set of fragments that cover the most unique terms...? Sort like a coord factor, but for highlighting not BooleanQuery. Is it only required clauses you need to fix? Mike On Thu, Mar 18, 2010 at 5:43 AM, chris.stodola stadler.christ...@web.de wrote: Hi Erick, I did as recommended and changed the query approprietly. But the result is still the same. On page 78 in the book lucene in action it is explained how scoring is working. Therefore I get more results than the exact match I was expecting. But how can I highlight in a large document only the results identified by a certain query like +contents:term +contents:query? Are there any alternatives to the QueryScore method? any examples? any papers to read first? thx christian Erick Erickson wrote: Try +contents:term +contents:query. By misplacing the '+' you're getting the default OR operator and the '+' is probably being thrown away by the analyzer. Luke will help here a lot. HTH Erick On Mon, Mar 15, 2010 at 9:46 AM, christian stadler stadler.christ...@web.de wrote: Hi there, I have an issue with the QueryScorer(query) method at the moment and I need some assistance. I was indexing my e-book lucene in action and based on this index-db I started to play around with some boolean queries like: (contents:+term contents:+query) As a result I'm expecting as a perfect match for the phrase term query four hits. But when I run my sample to highlight this phrase in the context then I get a lot more results. It also finds all the matches for term and query independently. I think the problem is the QueryScorer() which softens the former exact boolean query. Then I was trying the following: private static Highlighter GetHits(Query query, Formatter formatter) { string filed = contents BooleanQuery termsQuery = new BooleanQuery(); WeightedTerm[] terms = QueryTermExtractor.GetTerms(query, true, field); foreach (WeightedTerm term in terms) { TermQuery termQuery = new TermQuery(new Term(field, term.GetTerm())); termsQuery.Add(termQuery, BooleanClause.Occur.MUST); } // create query scorer based on term queries (field specific) QueryScorer scorer = new QueryScorer(termsQuery); Highlighter highlighter = new Highlighter(formatter, scorer); highlighter.SetTextFragmenter(new SimpleFragmenter(20)); return highlighter; } to rewrite the query and set the term attribute from SHOULD to MUST But the result was the same. Do you have any example how I can use the QueryScorer() in exactly the same way as to mimic a BooleanSearch?? thanks in advance Christian - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- View this message in context: http://old.nabble.com/How-can-I-use-QueryScorer%28%29-to-find-only-perfect-matches---tp27904831p27943914.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846890#action_12846890 ] Shai Erera commented on LUCENE-2328: ok so let me see if I understand this. Before Earwin suggested adding synced to Directory, the approach (as I understood it) was - whenever deleter deletes a file, remove it from synced as well. After Earwin's suggestion, which I like very much, as it moves more stuff out of IW, which could use some simplification, I initially thought that we should do this: when dir.sync is called, add that file to dir.synced. Then when dir.delete is called, remove it from there. When dir.commit is called, add all changed/synced files to the set (probably all of them). Something very straightforward and simple. However, the last two posts seem to try to complicate it ... and I don't understand why. So I'd appreciate if you can explain what am I missing. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2330) Allow easy extension of IndexWriter
Allow easy extension of IndexWriter --- Key: LUCENE-2330 URL: https://issues.apache.org/jira/browse/LUCENE-2330 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 IndexWriter is not so easy to extend. It hides a lot of useful methods from extending classes as well as useful members (like infoStream). Most of this stuff is very straightforward and I believe it's not exposed for no particular reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a ParallelWriter which will support the parallel indexing requirements. For that I'll need access to several methods and members. I plan to contain in this issue some simple hooks, nothing fancy (and hopefully controversial). I'll leave the rest to specific issues. For now: # Introduce a protected default constructor and init(Directory, IndexWriterConfig). That's required because ParallelWriter does not itself index anything, but instead delegates to its Slices. So that ctor is for convenience only, and I'll make it clear (through javadocs) that if one uses it, one needs to call init(). PQ has the same pattern. # Expose some members and methods that are useful for extensions (such as config, infoStream etc.). Some candidates are package-private methods, but these will be reviewed and converted on a case by case basis. I don't plan to do anything drastic here, just prepare IW for easier extendability. I'll post a patch after LUCENE-2320 is committed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846892#action_12846892 ] Ritesh Nigam commented on LUCENE-2280: -- I installed a test setup with lucene 3.0.0 and tried to reproduce the scenario with NPE, but after the Exception thrown, main index file is not getting deleted but only optimize is failing and i can see some small index file (.cfs) also along with main index file, and one more thing here is i am not using commit yet, but using close(), does close do the same thing as commit does? By looking at above behavior, is there a bug in 2.3.2 version where this kind of situaion is not handled properly? Can you please have a look at the log which i got after turning on the infostream for IndexWriter(for lucene 2.3.2). Attached as lucene.zip. IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam Attachments: lucene.jar, lucene.zip I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-2326: -- Attachment: TestVnowballVocabData.zip LUCENE-2326-snowball-try2.patch Here the patch without external references. The data dir was cleaned up (removed the large unneeded diff.txt files) and the zip compressed with -9. Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, LUCENE-2326.patch, TestVnowballVocabData.zip As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846897#action_12846897 ] Uwe Schindler commented on LUCENE-2326: --- Sorry, ZIP file has wrong name. Fixed here locally (test+zip). Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, LUCENE-2326.patch, TestVnowballVocabData.zip As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846899#action_12846899 ] Earwin Burrfoot commented on LUCENE-2328: - I'm proposing something even more dead simple. 1. We remove Directory.sync(String) completely. 2. Each time you call IndexOutput.close(), Dir adds this file to its internal set (if it cares about it at all). 3. If you call Directory.delete(), it also removes file from the set (though not strictly necessary). 4. When you commit at IW, it calls Directory.sync() and everything in its internal set gets synced. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846900#action_12846900 ] Robert Muir commented on LUCENE-2326: - Thanks Uwe, this simplifies our tests. Its nice to remove a network connection (it seems reliable so far, but...) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, LUCENE-2326.patch, TestVnowballVocabData.zip As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846902#action_12846902 ] Earwin Burrfoot commented on LUCENE-2328: - Btw, initial problem stems from the fact that IW/IR keeps track of the files it *has already* synced, instead of the files it *has not yet* synced. Which is kinda upside down, and requires upkeep, unlike straightforward approach in which this set gets cleared anew after each commit call. I can conjure up a patch in a day or two. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2330) Allow easy extension of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846903#action_12846903 ] Earwin Burrfoot commented on LUCENE-2330: - Please, only open up something if you decorate it with @experimental @will.change.without.single.warning annotations like a christmas tree. With luceneish freakyish back-compat policy you want to have as few things public as possible :) Allow easy extension of IndexWriter --- Key: LUCENE-2330 URL: https://issues.apache.org/jira/browse/LUCENE-2330 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 IndexWriter is not so easy to extend. It hides a lot of useful methods from extending classes as well as useful members (like infoStream). Most of this stuff is very straightforward and I believe it's not exposed for no particular reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a ParallelWriter which will support the parallel indexing requirements. For that I'll need access to several methods and members. I plan to contain in this issue some simple hooks, nothing fancy (and hopefully controversial). I'll leave the rest to specific issues. For now: # Introduce a protected default constructor and init(Directory, IndexWriterConfig). That's required because ParallelWriter does not itself index anything, but instead delegates to its Slices. So that ctor is for convenience only, and I'll make it clear (through javadocs) that if one uses it, one needs to call init(). PQ has the same pattern. # Expose some members and methods that are useful for extensions (such as config, infoStream etc.). Some candidates are package-private methods, but these will be reviewed and converted on a case by case basis. I don't plan to do anything drastic here, just prepare IW for easier extendability. I'll post a patch after LUCENE-2320 is committed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane updated LUCENE-2327: -- Attachment: CheckIndex.txt CheckIndex output generated by Luke v1.0.0. IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor Attachments: CheckIndex.txt When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2330) Allow easy extension of IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846906#action_12846906 ] Shai Erera commented on LUCENE-2330: Sure, I'll annotate whatever is needed for PI (e.g. protected/public but still for internal use) as @lucene.experimental. After we see more than one extension of IW, we can decide whether those API need to made 'public' in essence (i.e. w/o the annotation). I've been burned plenty of times w/ bw policy :). Allow easy extension of IndexWriter --- Key: LUCENE-2330 URL: https://issues.apache.org/jira/browse/LUCENE-2330 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Fix For: 3.1 IndexWriter is not so easy to extend. It hides a lot of useful methods from extending classes as well as useful members (like infoStream). Most of this stuff is very straightforward and I believe it's not exposed for no particular reason. Over in LUCENE-1879 I plan extend IndexWriter to provide a ParallelWriter which will support the parallel indexing requirements. For that I'll need access to several methods and members. I plan to contain in this issue some simple hooks, nothing fancy (and hopefully controversial). I'll leave the rest to specific issues. For now: # Introduce a protected default constructor and init(Directory, IndexWriterConfig). That's required because ParallelWriter does not itself index anything, but instead delegates to its Slices. So that ctor is for convenience only, and I'll make it clear (through javadocs) that if one uses it, one needs to call init(). PQ has the same pattern. # Expose some members and methods that are useful for extensions (such as config, infoStream etc.). Some candidates are package-private methods, but these will be reviewed and converted on a case by case basis. I don't plan to do anything drastic here, just prepare IW for easier extendability. I'll post a patch after LUCENE-2320 is committed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846912#action_12846912 ] Shane commented on LUCENE-2327: --- The index is relatively old and doesn't appear to have been modified for a number of years. I can't say for certain about prior exceptions. If the CheckIndex results provides any more details, then great. Regardless, I'm willing to chalk this up to a system specific error and close the ticket. I was able to fix the index using Luke. IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor Attachments: CheckIndex.txt When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846913#action_12846913 ] Shai Erera commented on LUCENE-2328: How would IndexInput report back to the Directory when its close() was called? I've checked a couple of Directories and when they openInput, they don't pass themselves to the IndexInput. I think what you say makes sense, but I don't see how this can be implemented w/ the current implementations (and w/o relying on broken Directory impls out there). Broken in the sense that they don't expect to get any notification from IndexInput.close(). Other than that, I like that approach. Also, what you wrote about IW keeping track on already synced files - I guess you'll change that when it moves into Directory, so that it will track the files it hasn't synced yet? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2326) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals
[ https://issues.apache.org/jira/browse/LUCENE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2326. --- Resolution: Fixed Committed revision: 924781 (with correct zip file name) Remove SVN.exe and revision numbers from build.xml by svn-copy the backwards branch and linking snowball tests by svn:externals --- Key: LUCENE-2326 URL: https://issues.apache.org/jira/browse/LUCENE-2326 Project: Lucene - Java Issue Type: Improvement Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch, 3.1 Attachments: LUCENE-2326-snowball-try2.patch, LUCENE-2326.patch, LUCENE-2326.patch, TestVnowballVocabData.zip As we often need to update backwards tests together with trunk and always have to update the branch first, record rev no, and update build xml, I would simply like to do a svn copy/move of the backwards branch. After a release, this is simply also done: {code} svn rm backwards svn cp releasebranch backwards {code} By this we can simply commit in one pass, create patches in one pass. The snowball tests are currently downloaded by svn.exe, too. These need a fixed version for checkout. I would like to change this to use svn:externals. Will provide patch, soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2331) Add NoOpMergePolicy
Add NoOpMergePolicy --- Key: LUCENE-2331 URL: https://issues.apache.org/jira/browse/LUCENE-2331 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Shai Erera Fix For: 3.1 I'd like to add a simple and useful MP implementation which does nothing ! :). I've came across many places where either the following is documented or implemented: if you want to prevent merges, set mergeFactor to a high enough value. I think a NoOpMergePolicy is just as good, and can REALLY allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL). As such, NoOpMergePolicy will be introduced as a singleton, and can be used for convenience purposes only. Also, for Parallel Index it's important, because I'd like the slices to never do any merges, unless ParallelWriter decides so. So they should be set w/ that MP. I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to change it afterwards. About the name - I like the name, but suggestions are welcome. I thought of a NullMergePolicy, but I don't like 'Null' used for a NoOp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846926#action_12846926 ] Earwin Burrfoot commented on LUCENE-2331: - NoMergesPolicy - that's exactly what it is, a policy of no merges Add NoOpMergePolicy --- Key: LUCENE-2331 URL: https://issues.apache.org/jira/browse/LUCENE-2331 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Shai Erera Fix For: 3.1 I'd like to add a simple and useful MP implementation which does nothing ! :). I've came across many places where either the following is documented or implemented: if you want to prevent merges, set mergeFactor to a high enough value. I think a NoOpMergePolicy is just as good, and can REALLY allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL). As such, NoOpMergePolicy will be introduced as a singleton, and can be used for convenience purposes only. Also, for Parallel Index it's important, because I'd like the slices to never do any merges, unless ParallelWriter decides so. So they should be set w/ that MP. I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to change it afterwards. About the name - I like the name, but suggestions are welcome. I thought of a NullMergePolicy, but I don't like 'Null' used for a NoOp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846933#action_12846933 ] Uwe Schindler commented on LUCENE-2302: --- bq. How about instead of TermToBytesRefAttribute we name it TermBytesAttribute? (Ie, drop the To and Ref). This attribute is special, it only has this getter for the bytesref. If we need a real BytesTermAttribute it should be explicitely defined. Now open is NumericTokenStream and so on Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable) Key: LUCENE-2302 URL: https://issues.apache.org/jira/browse/LUCENE-2302 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: Flex Branch Reporter: Uwe Schindler Fix For: Flex Branch Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array. Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence I propose to create a new interface CharTermAttribute with a clean new API that concentrates on CharSequence and Appendable. The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute. To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-2302. --- Resolution: Fixed Was accidently committed with merge. Sorry. Revision: 924791 Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable) Key: LUCENE-2302 URL: https://issues.apache.org/jira/browse/LUCENE-2302 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: Flex Branch Reporter: Uwe Schindler Fix For: Flex Branch Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array. Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence I propose to create a new interface CharTermAttribute with a clean new API that concentrates on CharSequence and Appendable. The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute. To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846936#action_12846936 ] Andrzej Bialecki commented on LUCENE-2329: --- Slightly off-topic ... Having a facility to obtain termID-s per segment (or better yet per index) would greatly benefit Solr's UnInverted field creation, which currently needs to assign term ids by linear scanning. Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846938#action_12846938 ] Earwin Burrfoot commented on LUCENE-2328: - How would IndexInput report back to the Directory when its close() was called? I've checked a couple of Directories and when they openInput, they don't pass themselves to the IndexInput. Hmm. I guess I have to change IndexOutput impls? so that it will track the files it hasn't synced yet? Sure IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2332) Mrge CharTermAttribute and deprecations to trunk
Mrge CharTermAttribute and deprecations to trunk Key: LUCENE-2332 URL: https://issues.apache.org/jira/browse/LUCENE-2332 Project: Lucene - Java Issue Type: New Feature Affects Versions: 3.1 Reporter: Uwe Schindler Assignee: Uwe Schindler This should be merged to trunk until flex lands, so the analyzers can be ported to new api. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2332) Mrge CharTermAttribute and deprecations to trunk
[ https://issues.apache.org/jira/browse/LUCENE-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846943#action_12846943 ] Robert Muir commented on LUCENE-2332: - I agree. This gives us a chance to make sure it really works the way we want, by porting all our own analyzers to the attribute. Also, we can hopefully simplify/improve some code (e.g. PatternReplaceFilter) with the new capabilities. Mrge CharTermAttribute and deprecations to trunk Key: LUCENE-2332 URL: https://issues.apache.org/jira/browse/LUCENE-2332 Project: Lucene - Java Issue Type: New Feature Affects Versions: 3.1 Reporter: Uwe Schindler Assignee: Uwe Schindler This should be merged to trunk until flex lands, so the analyzers can be ported to new api. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846944#action_12846944 ] Michael McCandless commented on LUCENE-2328: Keeping track of not-yet-sync'd files instead of sync'd files is better, but it still requires upkeep (ie when file is deleted you have to remove it) because files can be opened, written to, closed, deleted without ever being sync'd. And I like moving this tracking under Dir -- that's where it belongs. bq. I assume that on calling syncEveryoneAndHisDog() you should sync all files that have been written to, and were closed, and not yet deleted. This will over-sync in some situations. Ie, causing commit to take longer than it should. EG say a merge has finished with the first set of files (say _X.fdx/t, since it merges fields first) but is still working on postings, when the user calls commit. We should not then sync _X.fdx/t because they are unreferenced by the segments_N we are committing. Or the merge has finished (so _X.* has been created) but is now off building the _X.cfs file -- we don't want to sync _X.*, only _X.cfs when its done. Another example: we don't do this today, but, addIndexes should really run fully outside of IW's normal segments file, merging away, and then only on final success alter IW's segmentInfos. If we switch to that, we don't want to sync all the files that addIndexes is temporarily writing... The knowledge of which files make up the transaction lives above Directory... so I think we should retain the per-file control. I proposed the bulk-sync API so that Dir impls could choose to do a system-wide sync. Or, more generally, any Dir which can be more efficient if it knows the precise set of files that must be sync'd right now. If we stick with file-by-file API, doing a system-wide sync is somewhat trickier... because you can't assume from one call to the next that nothing had changed. Also, bulk sync better matches the semantics IW/IR require: these consumers don't care the order in which these files are sync'd. They just care that the requested set is sync'd. So it exposes a degree of freedom to the Dir impls that's otherwise hidden today. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846946#action_12846946 ] Michael McCandless commented on LUCENE-2329: This issue is just about how IndexWriter's RAM buffer stores its terms... But, the flex API adds long ord() and seek(long ord) to the TermsEnum API. Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846947#action_12846947 ] Michael McCandless commented on LUCENE-2327: Yikes -- you had 10 corrupted segments (of 23) and there's at least 4 different flavors of corruption across those segments! Curious... What storage device did you store the index on? ;) Note the that fix just drops those segments from the index, so any docs that were in them are lost. IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor Attachments: CheckIndex.txt When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2302) Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable)
[ https://issues.apache.org/jira/browse/LUCENE-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler reopened LUCENE-2302: --- Assignee: Uwe Schindler Add note to backwards compatibility section: - TermAttribute now changed toString() behaviour - Token now implemnts CharSequence but violates its contract Replacement for TermAttribute+Impl with extended capabilities (byte[] support, CharSequence, Appendable) Key: LUCENE-2302 URL: https://issues.apache.org/jira/browse/LUCENE-2302 Project: Lucene - Java Issue Type: Improvement Components: Analysis Affects Versions: Flex Branch Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: Flex Branch Attachments: LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch, LUCENE-2302.patch For flexible indexing terms can be simple byte[] arrays, while the current TermAttribute only supports char[]. This is fine for plain text, but e.g NumericTokenStream should directly work on the byte[] array. Also TermAttribute lacks of some interfaces that would make it simplier for users to work with them: Appendable and CharSequence I propose to create a new interface CharTermAttribute with a clean new API that concentrates on CharSequence and Appendable. The implementation class will simply support the old and new interface working on the same term buffer. DEFAULT_ATTRIBUTE_FACTORY will take care of this. So if somebody adds a TermAttribute, he will get an implementation class that can be also used as CharTermAttribute. As both attributes create the same impl instance both calls to addAttribute are equal. So a TokenFilter that adds CharTermAttribute to the source will work with the same instance as the Tokenizer that requested the (deprecated) TermAttribute. To also support byte[] only terms like Collation or NumericField needs, a separate getter-only interface will be added, that returns a reusable BytesRef, e.g. BytesRefGetterAttribute. The default implementation class will also support this interface. For backwards compatibility with old self-made-TermAttribute implementations, the indexer will check with hasAttribute(), if the BytesRef getter interface is there and if not will wrap a old-style TermAttribute (a deprecated wrapper class will be provided): new BytesRefGetterAttributeWrapper(TermAttribute), that is used by the indexer then. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846954#action_12846954 ] Shane commented on LUCENE-2327: --- I believe at the time we were storing on a NAS via NFS. If my memory serves me well, there were known issues with running Lucene over NFS at the time. We were experiencing issues with the file system at the time so have since moved to a different architecture. Also, I was aware that the fix drops the segments, but thanks anyway. :) IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor Attachments: CheckIndex.txt When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846956#action_12846956 ] Earwin Burrfoot commented on LUCENE-2328: - Keeping track of not-yet-sync'd files instead of sync'd files is better, but it still requires upkeep (ie when file is deleted you have to remove it) because files can be opened, written to, closed, deleted without ever being sync'd. You can just skip this and handle FileNotFound exception when syncing. Have to handle it anyway, no guarantees some file won't be snatched from under your nose. This will over-sync in some situations. Don't feel this is a serious problem. If you over-sync (in fact sync some files a little bit earlier than strictly required), in a few seconds you will under-sync, so total time is still the same. But I feel you're somewhat missing the point. System-wide sync is not the original aim, it's just a possible byproduct of what is the original aim - to move sync tracking code from IW to Directory. And I don't see at all how adding batch-syncs achieves this. If you're calling sync(CollectionString), damn, you should keep that collection somewhere :) and it is supposed to be inside! IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846960#action_12846960 ] Michael McCandless commented on LUCENE-2328: {quote} bq. Keeping track of not-yet-sync'd files instead of sync'd files is better, but it still requires upkeep (ie when file is deleted you have to remove it) because files can be opened, written to, closed, deleted without ever being sync'd. You can just skip this and handle FileNotFound exception when syncing. Have to handle it anyway, no guarantees some file won't be snatched from under your nose. {quote} IW IR do in fact guarantee they will never ask for a deleted file to be sync'd. If they ever do that we have more serious problems ;) {quote} bq. This will over-sync in some situations. Don't feel this is a serious problem. If you over-sync (in fact sync some files a little bit earlier than strictly required), in a few seconds you will under-sync, so total time is still the same. {quote} I think this is important -- commit is already slow enough -- why make it slower? Further, the extra files you sync'd may never have needed to be sync'd (they will be merged away). My examples above include such cases. Turning this around... what's so bad about keeping the sync per file? bq. System-wide sync is not the original aim, it's just a possible byproduct of what is the original aim I know this is not the aim of this issue, rather just a nice by-product if we switch to a global sync method. bq. to move sync tracking code from IW to Directory. Right this is a great step forward, as long as long as we don't slow commit by dumbing down the API :) bq. And I don't see at all how adding batch-syncs achieves this. You're right: this doesn't achieve / is not required for moving sync'd file tracking down to Dir. It's orthogonal, but, is another way that we could allow Dir impls to do global sync. I'm proposing this as a different change, to make the API better match the needs of its consumers. In fact, really the OS ought to allow for this as well (but I know of none that do) since it'd give the IO scheduler more freedom on which bytes need to be moved to disk. We can open this one as a separate issue... IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-2327) IndexOutOfBoundsException in FieldInfos.java
[ https://issues.apache.org/jira/browse/LUCENE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-2327. Resolution: Invalid OK I'm resolving as optimistically invalid :) IndexOutOfBoundsException in FieldInfos.java Key: LUCENE-2327 URL: https://issues.apache.org/jira/browse/LUCENE-2327 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 3.0.1 Environment: Fedora 12 Reporter: Shane Priority: Minor Attachments: CheckIndex.txt When retrieving the scoreDocs from a multisearcher, the following exception is thrown: java.lang.IndexOutOfBoundsException: Index: 52, Size: 4 at java.util.ArrayList.rangeCheck(ArrayList.java:571) at java.util.ArrayList.get(ArrayList.java:349) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:285) at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:274) at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:86) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131) at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:232) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:179) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:911) at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:644) The error is caused when the fieldNumber passed to FieldInfos.fieldInfo() is greater than the size of array list containing the FieldInfo values. I am not sure what the field number represents or why it would be larger than the array list's size. The quick fix would be to validate the bounds but there may be a bigger underlying problem. The issue does appear to be directly related to LUCENE-939. I've only been able to duplicate this in my production environment and so can't give a good test case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846984#action_12846984 ] Michael McCandless commented on LUCENE-2280: From the log I can see that you run fine for a long time, opening IW, indexing a few docs, optimizing, then closing. Then suddenly the exceptions start happening on many (but not all) merges, and, merges involving different segments. JRE bug seems most likely I guess... Since you see this only on Windows (not eg on Linux), I think this is likely not a bug in Lucene but rather something particular about your Windows env -- virus checker maybe? Is there anything in the Windows events log that correlate to when the exceptions start? Or it could be a JRE bug -- you really should try on different (Sun) JRE. IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam Attachments: lucene.jar, lucene.zip I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2280) IndexWriter.optimize() throws NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846985#action_12846985 ] Michael McCandless commented on LUCENE-2280: Yes close() does commit() internally. Are you saying you see the same exception on 3.0, using the IBM JRE? Can you try with the Sun JRE? IndexWriter.optimize() throws NullPointerException -- Key: LUCENE-2280 URL: https://issues.apache.org/jira/browse/LUCENE-2280 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.3.2 Environment: Win 2003, lucene version 2.3.2, IBM JRE 1.6 Reporter: Ritesh Nigam Attachments: lucene.jar, lucene.zip I am using lucene 2.3.2 search APIs for my application, i am indexing 45GB database which creates approax 200MB index file, after finishing the indexing and while running optimize() i can see NullPointerExcception thrown in my log and index file is getting corrupted, log says Caused by: java.lang.NullPointerException at org.apache.lucene.store.BufferedIndexOutput.writeBytes(BufferedIndexOutput.java:49) at org.apache.lucene.store.IndexOutput.writeBytes(IndexOutput.java:40) at org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:566) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:135) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3273) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2968) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:240) and this is happening quite frequently, although I am not able to reproduce it on demand, I saw an issue logged which is some what related to mine issue (http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200809.mbox/%3c6e4a40db-5efc-42da-a857-d59f4ec34...@mikemccandless.com%3e) but the only difference here is I am not using Store.Compress for my fields, i am using Store.NO instead. please note that I am using IBM JRE for my application. Is this an issue with lucene?, if yes it is fixed in which version? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846991#action_12846991 ] Earwin Burrfoot commented on LUCENE-2328: - Okay, summing up. 1. Directory gets a new method - sync(CollectionString), it will become abstract in 4.0, but now by default delegates to current sync(String), which is deprecated. 2. FSDirectory tracks newly written, closed and not deleted files, by changing FSD.IndexOutput accordingly. 3. sync() semantics changes from sync this now to sync this now, if you think it's needed. Noop sync() impls like RAMDir continue to be noop, FSDir syncs only those files that exist in its tracking set and ignores all others. 4. IW/IR stop tracking synced files completely (lots of garbage code gone from IW), and instead call sync(Collection) on commit with a list of all files that constitute said commit. These steps preserve back-compatibility (Except for cases of custom Directory impls in which calling sync on the same file sequentially is costly. They will suffer performance degradation), ensure that for each commit only strictly requested subset of files is synced (thing Mike insisted on), and will completely remove sync-tracking code from IW and IR. 5. We open another issue to experiment with batch syncing and various filesystems. Some relevant fun data: http://www.humboldt.co.uk/2009/03/fsync-across-platforms.html IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846996#action_12846996 ] Shai Erera commented on LUCENE-2328: bq. changing FSD.IndexOutput accordingly This worries me a bit. If only FSD.IndexOutput will do that, I'm afraid other Directory implementations won't realize that they should do so as well (NIO?). I'd prefer if IndexOutput in its contract is supposed to callback on Directory upon close ... not sure - maybe just put some heave documentation around createOutput? If we could enforce this API-wise, and let the Dirs that don't care simply ignore, then it'd be better. It'll also allow for someone to extend FSD.createOutput, return his own IndexOutput and not worry (or do, but knowingly) about calling back to Dir. Other than that - this looks great. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846998#action_12846998 ] Shai Erera commented on LUCENE-2331: I like NoMergesPolicy ... perhaps, like NoLockFactory, we can call it NoMergePolicy? so MP is preserved in the name (not that it's critical)? Add NoOpMergePolicy --- Key: LUCENE-2331 URL: https://issues.apache.org/jira/browse/LUCENE-2331 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Shai Erera Fix For: 3.1 I'd like to add a simple and useful MP implementation which does nothing ! :). I've came across many places where either the following is documented or implemented: if you want to prevent merges, set mergeFactor to a high enough value. I think a NoOpMergePolicy is just as good, and can REALLY allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL). As such, NoOpMergePolicy will be introduced as a singleton, and can be used for convenience purposes only. Also, for Parallel Index it's important, because I'd like the slices to never do any merges, unless ParallelWriter decides so. So they should be set w/ that MP. I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to change it afterwards. About the name - I like the name, but suggestions are welcome. I thought of a NullMergePolicy, but I don't like 'Null' used for a NoOp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12846997#action_12846997 ] Shai Erera commented on LUCENE-2320: Mike - are you reviewing it? I think I fixed all mentioned comments. Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847008#action_12847008 ] Michael McCandless commented on LUCENE-2320: The patch looks great Shai -- I plan to commit in a day or two. I added @lucene.experimental to SetOnce's jdocs, and also removed stale javadoc in MP and MS saying that you need access to package-private APIs (unrelated to this issue but spotted it ;). Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847010#action_12847010 ] Earwin Burrfoot commented on LUCENE-2328: - Every Directory implementation decides how to handle sync() calls on its own. The fact that FSDir (and descendants) do this performance optimization is their implementation details. I don't want to bind this somehow into the base class. But, I will note in javadocs to sync() that clients may pass the same file over and over again, so you might want to optimize for this. IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
Alight, so we have implemented Hoss' suggestion here on the lucene/solr merged dev branch at lucene/solr/branches/newtrunk. Feel free to check it out and give some feedback. We also roughly have Solr running on Lucene trunk - eg compiling Solr will first compile lucene and run off those compiled class files. Running dist or example in Solr will grab Lucene's jars and put them in the war. This still needs further love, but it works. There is also a top level build.xml with two targets: clean, and test. Clean will clean both Lucene and Solr, and test will run tests for both Lucene and Solr. Thanks to everyone that contributed to getting all this working! -- - Mark http://www.lucidimagination.com On 03/17/2010 12:40 PM, Mark Miller wrote: Okay, so this looks good to me (a few others seemed to like it - though Lucene-Dev was somehow dropped earlier) - lets try this out on the branch? (then we can get rid of that horrible branch name ;) ) Anyone on the current branch object to having to do a quick svn switch? On 03/16/2010 06:46 PM, Chris Hostetter wrote: : Otis, yes, I think so, eventually. But that's gonna take much more discussion. : : I don't think this initial cutover should try to solve how modules : will be organized, yet... we'll get there, eventually. But we should at least consider it, and not move in a direction that's distinct from the ultimate goal of better refactoring (especailly since that was one of the main goals of unifying development efforts) Here's my concrete suggestion that could be done today (for simplicity: $svn = https://svn.apache.org/repos/asf/lucene)... svn mv $svn/java/trunk $svn/java/tmp-migration svn mkdir $svn/java/trunk svn mv $svn/solr/trunk $svn/java/trunk/solr svn mv $svn/java/tmp-migration $svn/java/trunk/core At which point: 0. People who want to work only on Lucene-Java can start checking out $svn/java/trunk/core (i'm pretty sure existing checkouts will continue to work w/o any changes, the svn info should just update itself) 1. build files can be added to (the new) $svn/java/trunk to build ./core followed by ./solr 2. the build files in $svn/java/trunk/solr can be modified to look at ../core/ to find lucene jars 3. people who care about Solr (including all committers) should start checking out and building all of $svn/java/trunk 4. Long term, we could choose to branch all of $svn/java/trunk for releases ... AND/OR we could choose to branch specific modules (ie: solr) independently (with modifications to the build files on those branches to pull in their dependencies from alternate locations 5. Long term, we can start refactoring additional modules out of $svn/java/trunk/solr and $svn/java/trunk/core (like $svn/java/trunk/core/contrib) into their own directory in $svn/java/trunk 6. Long term, people who want to work on more then just core but don't care about certain modules (like solr) can do a simple non-recursive checkout of $svn/java/trunk and then do full checkouts of whatever modules they care about (Please note: I'm just trying to list things we *could* do if we go this route, i'm not advocating that we *should* do any of these things) I can't think of any objections people have raised to any of the previous suggestions which apply to this suggestion. Is there anything people can think of that would be useful, but not possible, if we go this route? -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: lucene and solr trunk
All tests pass for me :) Mike On Thu, Mar 18, 2010 at 12:27 PM, Mark Miller markrmil...@gmail.com wrote: Alight, so we have implemented Hoss' suggestion here on the lucene/solr merged dev branch at lucene/solr/branches/newtrunk. Feel free to check it out and give some feedback. We also roughly have Solr running on Lucene trunk - eg compiling Solr will first compile lucene and run off those compiled class files. Running dist or example in Solr will grab Lucene's jars and put them in the war. This still needs further love, but it works. There is also a top level build.xml with two targets: clean, and test. Clean will clean both Lucene and Solr, and test will run tests for both Lucene and Solr. Thanks to everyone that contributed to getting all this working! -- - Mark http://www.lucidimagination.com On 03/17/2010 12:40 PM, Mark Miller wrote: Okay, so this looks good to me (a few others seemed to like it - though Lucene-Dev was somehow dropped earlier) - lets try this out on the branch? (then we can get rid of that horrible branch name ;) ) Anyone on the current branch object to having to do a quick svn switch? On 03/16/2010 06:46 PM, Chris Hostetter wrote: : Otis, yes, I think so, eventually. But that's gonna take much more discussion. : : I don't think this initial cutover should try to solve how modules : will be organized, yet... we'll get there, eventually. But we should at least consider it, and not move in a direction that's distinct from the ultimate goal of better refactoring (especailly since that was one of the main goals of unifying development efforts) Here's my concrete suggestion that could be done today (for simplicity: $svn = https://svn.apache.org/repos/asf/lucene)... svn mv $svn/java/trunk $svn/java/tmp-migration svn mkdir $svn/java/trunk svn mv $svn/solr/trunk $svn/java/trunk/solr svn mv $svn/java/tmp-migration $svn/java/trunk/core At which point: 0. People who want to work only on Lucene-Java can start checking out $svn/java/trunk/core (i'm pretty sure existing checkouts will continue to work w/o any changes, the svn info should just update itself) 1. build files can be added to (the new) $svn/java/trunk to build ./core followed by ./solr 2. the build files in $svn/java/trunk/solr can be modified to look at ../core/ to find lucene jars 3. people who care about Solr (including all committers) should start checking out and building all of $svn/java/trunk 4. Long term, we could choose to branch all of $svn/java/trunk for releases ... AND/OR we could choose to branch specific modules (ie: solr) independently (with modifications to the build files on those branches to pull in their dependencies from alternate locations 5. Long term, we can start refactoring additional modules out of $svn/java/trunk/solr and $svn/java/trunk/core (like $svn/java/trunk/core/contrib) into their own directory in $svn/java/trunk 6. Long term, people who want to work on more then just core but don't care about certain modules (like solr) can do a simple non-recursive checkout of $svn/java/trunk and then do full checkouts of whatever modules they care about (Please note: I'm just trying to list things we *could* do if we go this route, i'm not advocating that we *should* do any of these things) I can't think of any objections people have raised to any of the previous suggestions which apply to this suggestion. Is there anything people can think of that would be useful, but not possible, if we go this route? -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847015#action_12847015 ] Michael McCandless commented on LUCENE-2328: Must the Dir insist the file is closed in order to sync it? Why not enroll newly created files in the to be sync'd set? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847018#action_12847018 ] Michael McCandless commented on LUCENE-2331: +1 for NoMergePolicy Add NoOpMergePolicy --- Key: LUCENE-2331 URL: https://issues.apache.org/jira/browse/LUCENE-2331 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Shai Erera Fix For: 3.1 I'd like to add a simple and useful MP implementation which does nothing ! :). I've came across many places where either the following is documented or implemented: if you want to prevent merges, set mergeFactor to a high enough value. I think a NoOpMergePolicy is just as good, and can REALLY allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL). As such, NoOpMergePolicy will be introduced as a singleton, and can be used for convenience purposes only. Also, for Parallel Index it's important, because I'd like the slices to never do any merges, unless ParallelWriter decides so. So they should be set w/ that MP. I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to change it afterwards. About the name - I like the name, but suggestions are welcome. I thought of a NullMergePolicy, but I don't like 'Null' used for a NoOp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847024#action_12847024 ] Michael Busch commented on LUCENE-2329: --- bq. This issue is just about how IndexWriter's RAM buffer stores its terms... Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on *disk*, rather than the strings. It would help things like similarity search and facet counting. {quote} But, note that term vectors today do not store the term char[] again - they piggyback on the term char[] already stored for the postings. {quote} Yeah I think I'm familiar with that part (secondary entry point in TermsHashPerField, hashes based on termStart). Haven't looked much into how the rest of the TermVector in-memory data structures are working. {quote} Though, I believe they store int textStart (increments by term length per unique term), which is less compact than the termID would be (increments +1 per unique term) {quote} Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course. This would be a nice simplification, because no hash collisions, no hash table resizing based on load factor, etc. would be necessary for non-primary TermsHashes? bq. so if eg we someday use packed ints we'd be more RAM efficient by storing termIDs... How does the read performance of packed ints compare to normal int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? Oh man, will we need flexible indexing for the in-memory index too? :) Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2320) Add MergePolicy to IndexWriterConfig
[ https://issues.apache.org/jira/browse/LUCENE-2320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847034#action_12847034 ] Shai Erera commented on LUCENE-2320: Thanks Mike ! Add MergePolicy to IndexWriterConfig Key: LUCENE-2320 URL: https://issues.apache.org/jira/browse/LUCENE-2320 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Shai Erera Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch, LUCENE-2320.patch Now that IndexWriterConfig is in place, I'd like to move MergePolicy to it as well. The change is not straightforward and so I've kept it for a separate issue. MergePolicy requires in its ctor an IndexWriter, however none can be passed to it before an IndexWriter actually exists. And today IW may create an MP just for it to be overridden by the application one line afterwards. I don't want to make iw member of MP non-final, or settable by extending classes, however it needs to remain protected so they can access it directly. So the proposed changes are: * Add a SetOnce object (to o.a.l.util), or Immutable, which can only be set once (hence its name). It'll have the signature SetOnceT w/ *synchronized setT* and *T get()*. T will be declared volatile, so that get() won't be synchronized. * MP will define a *protected final SetOnceIndexWriter writer* instead of the current writer. *NOTE: this is a bw break*. any suggestions are welcomed. * MP will offer a public default ctor, together with a set(IndexWriter). * IndexWriter will set itself on MP using set(this). Note that if set will be called more than once, it will throw an exception (AlreadySetException - or does someone have a better suggestion, preferably an already existing Java exception?). That's the core idea. I'd like to post a patch soon, so I'd appreciate your review and proposals. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847036#action_12847036 ] Shai Erera commented on LUCENE-2328: Yeah I guess I wasn't clear enough. So suppose someone sub-classes FSDir and overrides createOutput. How should he know his IndexOutput should call dir.sync()? How should he know he needs to pass the Dir to his IndexOutput? So I suggested to either mention it in the Javadocs, or somehow make all of FSDir's outputs know about that, API-wise ... So today a file is closed only upon commit (?), and it's then that it's synced? If so, why would you want to sync a file that is still open? I guess it cannot harm, but what's the use case? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847050#action_12847050 ] Michael McCandless commented on LUCENE-2328: In the current proposal, IndexOutput won't call dir.sync. All it will do is notify the dir when it was closed so the dir will record that filename as eligible for commit. Lucene today never syncs a file until after it's closed, but, conceivably some day it could. Or others who use the Dir API to write their own files could. At the OS level this is perfectly fine (in fact you have to pass an open fd to fsync). It seems presumptuous of the directory to silently ignore a call to sync just because the file hadn't been closed yet... IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847058#action_12847058 ] Michael McCandless commented on LUCENE-2329: bq. Actually, when I talked about the TermVectors I meant we should explore to store the termIDs on disk, rather than the strings. It would help things like similarity search and facet counting. A that would be great! bq. Actually we wouldn't need a second hashtable for the secondary TermsHash anymore, right? It would just have like the primary TermsHash a parallel array with the things that the TermVectorsTermsWriter.Postinglist class currently contains (freq, lastOffset, lastPosition)? And the index into that array would be the termID of course. Hmm the challenge is that the tracking done for term vectors is just within a single doc. Ie the hash used for term vectors only holds the terms for that one doc (so it's much smaller), vs the primary hash that holds terms for all docs in the current RAM buffer. So we'd be burning up much more RAM if we also key into the term vector's parallel arrays using the primary term id? But I do think we should cutover to parallel arrays for TVTW, too. bq. How does the read performance of packed ints compare to normal int[] arrays? I think nowadays RAM is less of an issue? And with a searchable RAM buffer we might want to sacrifice a bit more RAM for higher search performance? It's definitely slower to read/write to/from packed ints, and I agree, indexing and searching speed trumps RAM efficiency. bq. Oh man, will we need flexible indexing for the in-memory index too? EG custom attrs appearing in the TokenStream? Yes we will need to... but hopefully once we get serialization working cleanly for the attrs this'll be easy? With ByteSliceWriter/Reader you just .writeBytes and .readBytes... I don't think we should allow Codecs to be used in the RAM buffer anytime soon though... ;) Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2328) IndexWriter.synced field accumulates data leading to a Memory Leak
[ https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847063#action_12847063 ] Michael McCandless commented on LUCENE-2328: Yes please clean as you go Earwin -- those sound great. {quote} bq. Must the Dir insist the file is closed in order to sync it? Well, no, this can be relaxed. Because default Directory clients - IW+IR will never call sync() on a file they didn't close yet. Also this client behaviour is guaranteed with current implementation - if someone calls current sync() on an open file, it will fail on 'new RandomAccessFile'? {quote} I'd like to allow for this to work in the future, even if current FSDir impls cannot sync an open file. EG conceivably they could reach in and get the RAF that IndexOutput has open and sync it. So I think we just note this as a limitation of FSDir impls today, but, the API allows for it? IndexWriter.synced field accumulates data leading to a Memory Leak --- Key: LUCENE-2328 URL: https://issues.apache.org/jira/browse/LUCENE-2328 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.9.1, 2.9.2, 3.0, 3.0.1 Environment: all Reporter: Gregor Kaczor Priority: Minor Fix For: 3.1 Original Estimate: 1h Remaining Estimate: 1h I am running into a strange OutOfMemoryError. My small test application does index and delete some few files. This is repeated for 60k times. Optimization is run from every 2k times a file is indexed. Index size is 50KB. I did analyze the HeapDumpFile and realized that IndexWriter.synced field occupied more than half of the heap. That field is a private HashSet without a getter. Its task is to hold files which have been synced already. There are two calls to addAll and one call to add on synced but no remove or clear throughout the lifecycle of the IndexWriter instance. According to the Eclipse Memory Analyzer synced contains 32618 entries which look like file names _e065_1.del or _e067.cfs The index directory contains 10 files only. I guess synced is holding obsolete data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847068#action_12847068 ] Michael Busch commented on LUCENE-2329: --- bq. Hmm the challenge is that the tracking done for term vectors is just within a single doc. Duh! Of course you're right. Use parallel arrays instead of PostingList objects -- Key: LUCENE-2329 URL: https://issues.apache.org/jira/browse/LUCENE-2329 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.1 This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. In order to avoid having very many long-living PostingList objects in TermsHashPerField we want to switch to parallel arrays. The termsHash will simply be a int[] which maps each term to dense termIDs. All data that the PostingList classes currently hold will then we placed in parallel arrays, where the termID is the index into the arrays. This will avoid the need for object pooling, will remove the overhead of object initialization and garbage collection. Especially garbage collection should benefit significantly when the JVM runs out of memory, because in such a situation the gc mark times can get very long if there is a big number of long-living objects in memory. Another benefit could be to build more efficient TermVectors. We could avoid the need of having to store the term string per document in the TermVector. Instead we could just store the segment-wide termIDs. This would reduce the size and also make it easier to implement efficient algorithms that use TermVectors, because no term mapping across documents in a segment would be necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2331) Add NoOpMergePolicy
[ https://issues.apache.org/jira/browse/LUCENE-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847086#action_12847086 ] Shai Erera commented on LUCENE-2331: In the process, I'll also add a NoMergeScheduler which will have empty implementations of MS. That's kind of redundant if one uses NoMP, however for symmetry it's nice to have it, as well as for not running any unnecessary code, like CMS and its threads, just to discover MP returned nothing. Add NoOpMergePolicy --- Key: LUCENE-2331 URL: https://issues.apache.org/jira/browse/LUCENE-2331 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Shai Erera Fix For: 3.1 I'd like to add a simple and useful MP implementation which does nothing ! :). I've came across many places where either the following is documented or implemented: if you want to prevent merges, set mergeFactor to a high enough value. I think a NoOpMergePolicy is just as good, and can REALLY allow you disable merges (except for maybe set mergeFactor to Int.MAX_VAL). As such, NoOpMergePolicy will be introduced as a singleton, and can be used for convenience purposes only. Also, for Parallel Index it's important, because I'd like the slices to never do any merges, unless ParallelWriter decides so. So they should be set w/ that MP. I have a patch ready. Waiting for LUCENE-2320 to go in, so that I don't need to change it afterwards. About the name - I like the name, but suggestions are welcome. I thought of a NullMergePolicy, but I don't like 'Null' used for a NoOp. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Contrib tests fail if core jar is not up to date
Hi I've noticed that sometimes, after I run test-core and test-contrib, and then change core code, test-contrib fail on NoSuchMethodError and stuff like that. I've noticed that core.jar exists under build, and I assumed it's used by test-contrib, and probably is not recreated after core code has changed. I verified it when looking in contrib-build.xml, which defines a property lucene.jar.present which is set to true if the jar is ... well, present. Which I believe is the reason for these failures. I've been thinking how to resolve that, and I can think of two ways: (1) have test-core always delete that file, but that has two issues: (1.1) It's redundant if the code hasn't changed. (1.2) It forces you to either jar-core or test-core before you test-contrib, if you want to make sure you run w/ the latest jar. or (2) have test-contrib always call jar-core, which will first delete the file and then re-create it by compiling first. Compiling should not do anything if the code hasn't changed. So the only waste would be to create the .jar, but I think that's quite fast? Does anyone, with more Ant skills than me, know of a better way to detect from test-contrib that core code has changed and only then rebuild the jar? Shai
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847123#action_12847123 ] Jason Rutherglen commented on LUCENE-2312: -- For the skip list, we could reuse what we have (ie, DefaultSkipListReader), though, we'd need to choose a default number of docs, pulled out of thin air, as there's no way to guesstimate per term before hand. Or we can have a single level skip list (more like an index) and binary search to find the value (assuming we have an int array instead of storing vints) in the skip list we're looking for. Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: 3.1 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Contrib tests fail if core jar is not up to date
On Thu, Mar 18, 2010 at 5:33 PM, Shai Erera ser...@gmail.com wrote: Hi I've noticed that sometimes, after I run test-core and test-contrib, and then change core code, test-contrib fail on NoSuchMethodError and stuff like that. I've noticed that core.jar exists under build, and I assumed it's used by test-contrib, and probably is not recreated after core code has changed. I verified it when looking in contrib-build.xml, which defines a property lucene.jar.present which is set to true if the jar is ... well, present. Which I believe is the reason for these failures. I've been thinking how to resolve that, and I can think of two ways: (1) have test-core always delete that file, but that has two issues: (1.1) It's redundant if the code hasn't changed. (1.2) It forces you to either jar-core or test-core before you test-contrib, if you want to make sure you run w/ the latest jar. or (2) have test-contrib always call jar-core, which will first delete the file and then re-create it by compiling first. Compiling should not do anything if the code hasn't changed. So the only waste would be to create the .jar, but I think that's quite fast? Does anyone, with more Ant skills than me, know of a better way to detect from test-contrib that core code has changed and only then rebuild the jar? Shai In addition to what Shai mentioned, I wanted to say that there are other oddities about how the contrib tests run in ant. For example, I'm not sure why we create the junitfailed.flag files (I think it has something to do with detecting top-level that a single contrib failed). I noticed this when working on https://issues.apache.org/jira/browse/LUCENE-1709, as I guess we should really fix it before doing that issue. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: Contrib tests fail if core jar is not up to date
Hi Shai, there is no way to do this with ant (detecting code change). The ant script *always* builds the jar file. In this case, it is just missing the dependency to jar-core in test-contrib. Alternatively, test-contrib should not use the jar file at all and simply add build/classes/java to classpath. The fix is simple, can do that tomorrow. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de http://www.thetaphi.de/ eMail: u...@thetaphi.de From: Shai Erera [mailto:ser...@gmail.com] Sent: Thursday, March 18, 2010 10:34 PM To: java-dev@lucene.apache.org Subject: Contrib tests fail if core jar is not up to date Hi I've noticed that sometimes, after I run test-core and test-contrib, and then change core code, test-contrib fail on NoSuchMethodError and stuff like that. I've noticed that core.jar exists under build, and I assumed it's used by test-contrib, and probably is not recreated after core code has changed. I verified it when looking in contrib-build.xml, which defines a property lucene.jar.present which is set to true if the jar is ... well, present. Which I believe is the reason for these failures. I've been thinking how to resolve that, and I can think of two ways: (1) have test-core always delete that file, but that has two issues: (1.1) It's redundant if the code hasn't changed. (1.2) It forces you to either jar-core or test-core before you test-contrib, if you want to make sure you run w/ the latest jar. or (2) have test-contrib always call jar-core, which will first delete the file and then re-create it by compiling first. Compiling should not do anything if the code hasn't changed. So the only waste would be to create the .jar, but I think that's quite fast? Does anyone, with more Ant skills than me, know of a better way to detect from test-contrib that core code has changed and only then rebuild the jar? Shai
Re: Contrib tests fail if core jar is not up to date
: In addition to what Shai mentioned, I wanted to say that there are : other oddities about how the contrib tests run in ant. For example, : I'm not sure why we create the junitfailed.flag files (I think it has : something to do with detecting top-level that a single contrib : failed). Correct ... even if one contrib fails, test-contrib attempts to run the tests for all the other contribs, and then fails if any junitfailed.flag files are found in any contribs. The assumption was if you were specificly testing a single contrib you'd be using the contrib specific build from it's own directory, and it would still fail fast -- it's only if you run test-contrib from the top level that it ignores when ant test fails for individual contribs, and then reports the failure at the end. It's a hack, but it's a useful hack for getting nightly builds that can report on the tests for all contribs, even if the first one fails (it's less useful when one contrib depends on another, but that's a more complex issue) -Hoss - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Contrib tests fail if core jar is not up to date
On Thu, Mar 18, 2010 at 5:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote: It's a hack, but it's a useful hack for getting nightly builds that can report on the tests for all contribs, even if the first one fails (it's less useful when one contrib depends on another, but that's a more complex issue) -Hoss Hoss, thanks, that makes sense. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847140#action_12847140 ] Jason Rutherglen commented on LUCENE-2312: -- I'm hitting a weird error where after executing a ram buf term docs iteration, adding some docs, then closing the the DWs and the writer, there's an exception which indicates some unknown (to me) state was modified because of the term docs iteration. Or maybe it's obvious? :) {code} org.apache.lucene.index.CorruptIndexException: docs out of order (-2147483648 = 2147483647 ) at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76) at org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:145) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:1185) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3824) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3733) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3712) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1807) {code} Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: 3.1 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2127) Improved large result handling
[ https://issues.apache.org/jira/browse/LUCENE-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847146#action_12847146 ] Jason Rutherglen commented on LUCENE-2127: -- What's the status of this one? I'm qausi interested in getting it into Solr. Improved large result handling -- Key: LUCENE-2127 URL: https://issues.apache.org/jira/browse/LUCENE-2127 Project: Lucene - Java Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: LUCENE-2127.patch, LUCENE-2127.patch Per http://search.lucidimagination.com/search/document/350c54fc90d257ed/lots_of_results#fbb84bd297d15dd5, it would be nice to offer some other Collectors that are better at handling really large number of results. This could be implemented in a variety of ways via Collectors. For instance, we could have a raw collector that does no sorting and just returns the ScoreDocs, or we could do as Mike suggests and have Collectors that have heuristics about memory tradeoffs and only heapify when appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2312) Search on IndexWriter's RAM Buffer
[ https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-2312: - Comment: was deleted (was: I'm hitting a weird error where after executing a ram buf term docs iteration, adding some docs, then closing the the DWs and the writer, there's an exception which indicates some unknown (to me) state was modified because of the term docs iteration. Or maybe it's obvious? :) {code} org.apache.lucene.index.CorruptIndexException: docs out of order (-2147483648 = 2147483647 ) at org.apache.lucene.index.FormatPostingsDocsWriter.addDoc(FormatPostingsDocsWriter.java:76) at org.apache.lucene.index.FreqProxTermsWriter.appendPostings(FreqProxTermsWriter.java:209) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:127) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:145) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:72) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:64) at org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:1185) at org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:3824) at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:3733) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:3712) at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:1807) {code}) Search on IndexWriter's RAM Buffer -- Key: LUCENE-2312 URL: https://issues.apache.org/jira/browse/LUCENE-2312 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 3.0.1 Reporter: Jason Rutherglen Assignee: Michael Busch Fix For: 3.1 In order to offer user's near realtime search, without incurring an indexing performance penalty, we can implement search on IndexWriter's RAM buffer. This is the buffer that is filled in RAM as documents are indexed. Currently the RAM buffer is flushed to the underlying directory (usually disk) before being made searchable. Todays Lucene based NRT systems must incur the cost of merging segments, which can slow indexing. Michael Busch has good suggestions regarding how to handle deletes using max doc ids. https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923 The area that isn't fully fleshed out is the terms dictionary, which needs to be sorted prior to queries executing. Currently IW implements a specialized hash table. Michael B has a suggestion here: https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2323) reorganize contrib modules
[ https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir reassigned LUCENE-2323: --- Assignee: Robert Muir reorganize contrib modules -- Key: LUCENE-2323 URL: https://issues.apache.org/jira/browse/LUCENE-2323 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Reporter: Robert Muir Assignee: Robert Muir Attachments: LUCENE-2323.patch it would be nice to reorganize contrib modules, so that they are bundled together by functionality. For example: * the wikipedia contrib is a tokenizer, i think really belongs in contrib/analyzers * there are two highlighters, i think could be one highlighters package. * there are many queryparsers and queries in different places in contrib -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Contrib tests fail if core jar is not up to date
Uwe, (1) the problem is not the missing dependency, but rather the use of lucene.jar.present. So you'll need to remove it as well. (2) Adding build/classes/java is not enough - you'll need to add a target dependency on compile-core or something. I guess you already know that. Just pointing it out :). Thanks for taking care of this, Shai On Thu, Mar 18, 2010 at 11:51 PM, Robert Muir rcm...@gmail.com wrote: On Thu, Mar 18, 2010 at 5:50 PM, Chris Hostetter hossman_luc...@fucit.org wrote: It's a hack, but it's a useful hack for getting nightly builds that can report on the tests for all contribs, even if the first one fails (it's less useful when one contrib depends on another, but that's a more complex issue) -Hoss Hoss, thanks, that makes sense. -- Robert Muir rcm...@gmail.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org