[jira] Commented: (LUCENE-1016) TermVectorAccessor, transparent vector space access
[ https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558935#action_12558935 ] Karl Wettin commented on LUCENE-1016: - {quote} I'm curious if the build part of this would be faster than reanalyzing a document. {quote} It is a slow process on an index with many terms. Each one has to be iterated and mached against the document number. {quote} Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast. {quote} This patch is mostly about when you don't have access to the source data. It was used together with a TermVectorMappingCachedTokenStreamFactory to extract re-indexable documents from any directory. If you think of this peice of code and highlighter together, I would consider something else, perhaps a tool that could add the term vector to all documents missing one in a single iteration sweep of the index. I know very little about the file format and the highlighter though. > TermVectorAccessor, transparent vector space access > > > Key: LUCENE-1016 > URL: https://issues.apache.org/jira/browse/LUCENE-1016 > Project: Lucene - Java > Issue Type: New Feature > Components: Term Vectors >Affects Versions: 2.2 >Reporter: Karl Wettin >Priority: Minor > Attachments: LUCENE-1016.txt > > > This class visits TermVectorMapper and populates it with information > transparent by either passing it down to the default terms cache (documents > indexed with Field.TermVector) or by resolving the inverted index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-205) [PATCH] Patches for RussianAnalyzer
[ https://issues.apache.org/jira/browse/LUCENE-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558922#action_12558922 ] Vladimir Yuryev commented on LUCENE-205: Hi! I with you agree, that CP1251 - a small problem if to consider lacks RussianAnalyzer as a whole. For example - grammatic analysis of words of Russian is made not truly or approximately similarly to English language and so on. Correct analysis of words would provide faster search of words and other advantages of work of the analyzer. Therefore I also see that you are right in the remark. Vladimir Yuryev. * "Grant Ingersoll (JIRA)" <[EMAIL PROTECTED]> [Sat, 12 Jan 2008 15:03:35 https://issues.apache.org/jira/browse/LUCENE-205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Perhaps the https://issues.apache.org/jira/browse/LUCENE-205 -- Vladimir Yuryev. -- Rambler-ICQ 6 -- новый формат общения! http://icq.rambler.ru/ > [PATCH] Patches for RussianAnalyzer > --- > > Key: LUCENE-205 > URL: https://issues.apache.org/jira/browse/LUCENE-205 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: other > Platform: Other >Reporter: Vladimir Yuryev >Priority: Minor > Attachments: RussianAnalyzer.patch.txt, > RussianLetterTokenizer.patch.txt, RussianLowerCaseFilter.patch.txt, > RussianStemFilter.patch.txt, TestRussianAnalyzer.patch.txt > > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
On Jan 14, 2008, at 4:49 PM, Mark Miller wrote: While the overall framework of LUCENE-663 appears similar to the current contrib Highlighter, the code is actually quite different and I do not think it handles as many corner cases in its current state. LUCENE-663 supports PhraseQuerys by implementing 'special' search logic that inspects positional information to make sure the Tokens from a PhraseQuery are in order. I am not sure how exact this logic is compared to Lucenes PhraseQuery search logic, but a cursory look makes me think its not complete. It almost looks to me that it only does inorder with simple slop (not edit distance)...I am too lazy to check further though and I may have missed something. Also, LUCENE-663 does not support Span queries. This patch differs in that it fits the current Highlighter framework without modifying it, and it uses Lucene's own internal search logic to identify Spans for highlighting. PhraseQueries are handled by a SpanQuery approximation. As far as PhraseQuery/SpanQuery highlighting, I don't think any of the other Highlighter packages offer much. I think that things could be done a little faster, but that would require abandoning the current framework, and with all of the corner cases it now supports, I'd hate to see that. The other Highlighter code that is worth consideration is LUCENE-644. It does abandon the current Highlighter framework and goes with an attack that is much more efficient for larger documents: instead of attacking the problem by spinning through all of the document tokens and comparing to query tokens, 644 just looks at the tokens from the query and grabs the original text using the offsets from those tokens. This is darn fast, but doesnt go well with positional highlighting and I wonder how well it supports all of the corner cases that arise with overlapping tokens and whatnot. Hmm, I'm beginning to think that the performance issue may be overcome to some extent with the new TermVectorMapper stuff. Basic idea is that you construct a highlighter that does the appropriate highlighting as the TV is being loaded from disk through the Map function. This would save having to go back through all the tokens a second time, but probably has other issues. It's just a thought in my head at this point. At a minimum, I think the TVM could speed up the TokenSources part that creates the TokenStream based on the TermVector. At any rate, I am going to think some more on it. -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558856#action_12558856 ] Michael Busch commented on LUCENE-584: -- I think I understand now which problems you had when you wanted to change BooleanFilter and xml-query-parser to use the new Filter APIs. BooleanFilter is optimized to utilize BitSets for performing boolean operations fast. Now if we change BooleanFilter to use the new DocIdSetIterator, then it can't use the fast BitSet operations (e. g. union for or, intersect for and) anymore. Now we can introduce BitSetFilter as you suggested and what I did in the take4 patch. But here's the problem: Introducing subclasses of Filter doesn't play nicely with the caching mechanism in Lucene. For example: if we change BooleanFilter to only work with BitSetFilters, then it won't work with a CachingWrapperFilter anymore, because CachingWrapperFilter extends Filter. Then we would have to introduce new CachingWrapper***Filter, for the different kinds of Filter subclasses, which is a bad design as Mark pointed out in his comment: https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12547901#action_12547901 One solution would be to add a getBitSet() method to DocIdBitSet. DocIdBitSet is a new class that is basically just a wrapper around a Java BitSet and provides a DocIdSetIterator to access the BitSet. Then BooleanFilter could do something like this: {code:java} DocIdSet docIdSet = filter.getDocIdSet(); if (docIdSet instanceof DocIdBitSet) { BitSet bits = ((DocIdBitSet) docIdSet).getBitSet(); ... // existing code } else { throw new UnsupportedOperationException("BooleanFilter only supports Filters that use DocIdBitSet."); } {code} But then, changing the core filters to use OpenBitSets instead of Java BitSets is technically an API change, because BooleanFilter would not work anymore with the core filters. So if we took this approach we would have to wait until 3.0 to move the core from BitSet to OpenBitSet and also change BooleanFilter then to use OpenBitSets. BooleanFilter could then also work with either of the two BitSet implementions, but probably not with those two mixed. Any feedback about this is very welcome. I'll try to further think about how to marry the new Filter API, caching mechanism and Filter implementations like BooleanFilter nicely. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Assignee: Michael Busch >Priority: Minor > Fix For: 2.4 > > Attachments: bench-diff.txt, bench-diff.txt, > ContribQueries20080111.patch, lucene-584-take2.patch, > lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, > lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, > Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, > Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes
[ https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558854#action_12558854 ] Mark Harwood commented on LUCENE-494: - I personally don't use this but others may. It was easier to solve my particular problem by adding stop words to my XSL query templates (I added support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop words). This was more about ease of configuration in my particular app. I know Nutch has something similar implemented elsewhere - maybe in the query parser. I also had the notion that wrapping IndexReader to auto-cache TermDocs for super-popular terms using a BitSet would be a good way to avoid the IO overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span queries which need a TermPositions implementation but would work for straight TermQueries. > Analyzer for preventing overload of search service by queries with common > terms in large indexes > > > Key: LUCENE-494 > URL: https://issues.apache.org/jira/browse/LUCENE-494 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis >Affects Versions: 2.4 >Reporter: Mark Harwood >Assignee: Grant Ingersoll >Priority: Minor > Attachments: QueryAutoStopWordAnalyzer.java, > QueryAutoStopWordAnalyzerTest.java > > > An analyzer used primarily at query time to wrap another analyzer and provide > a layer of protection > which prevents very common words from being passed into queries. For very > large indexes the cost > of reading TermDocs for a very common word can be high. This analyzer was > created after experience with > a 38 million doc index which had a term in around 50% of docs and was causing > TermQueries for > this term to take 2 seconds. > Use the various "addStopWords" methods in this class to automate the > identification and addition of > stop words found in an already existing index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Issue Comment Edited: (LUCENE-1016) TermVectorAccessor, transparent vector space access
[ https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558825#action_12558825 ] gsingers edited comment on LUCENE-1016 at 1/14/08 2:57 PM: -- I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but using that doesn't account for non-TermVector based Documents that need to be analyzed. Was thinking this might account for both cases, all through the TermVectorMapper mechanism. Just doesn't seem like it would be very fast. was (Author: gsingers): I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but that doesn't account for non-TermVector based. Was thinking this might account for both cases. > TermVectorAccessor, transparent vector space access > > > Key: LUCENE-1016 > URL: https://issues.apache.org/jira/browse/LUCENE-1016 > Project: Lucene - Java > Issue Type: New Feature > Components: Term Vectors >Affects Versions: 2.2 >Reporter: Karl Wettin >Priority: Minor > Attachments: LUCENE-1016.txt > > > This class visits TermVectorMapper and populates it with information > transparent by either passing it down to the default terms cache (documents > indexed with Field.TermVector) or by resolving the inverted index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1016) TermVectorAccessor, transparent vector space access
[ https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558825#action_12558825 ] Grant Ingersoll commented on LUCENE-1016: - I'm curious if the build part of this would be faster than reanalyzing a document. Just thinking outloud, but I have wondering about a Highlighter that uses the new TermVectorMapper, but that doesn't account for non-TermVector based. Was thinking this might account for both cases. > TermVectorAccessor, transparent vector space access > > > Key: LUCENE-1016 > URL: https://issues.apache.org/jira/browse/LUCENE-1016 > Project: Lucene - Java > Issue Type: New Feature > Components: Term Vectors >Affects Versions: 2.2 >Reporter: Karl Wettin >Priority: Minor > Attachments: LUCENE-1016.txt > > > This class visits TermVectorMapper and populates it with information > transparent by either passing it down to the default terms cache (documents > indexed with Field.TermVector) or by resolving the inverted index. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558819#action_12558819 ] Michael Goddard commented on LUCENE-794: Mark, I've still got a little work to do on it, but would like to also include support for highlighting of RangeQuery within SpanNearQuery. I have a new SpanQuery subclass which helps, and will post that to see if it merits inclusion within Lucene. In conjunction with that, I'd have one last "else if" clause to add to the patch covered by this issue. Basically, I'm trying to make a case for the work covered in this Jira issue being committed, since it's very useful to me. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: spanhighlighter.patch, spanhighlighter10.patch, > spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, > spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, > spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, > spanhighlighter_patch_4.zip > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
While the overall framework of LUCENE-663 appears similar to the current contrib Highlighter, the code is actually quite different and I do not think it handles as many corner cases in its current state. LUCENE-663 supports PhraseQuerys by implementing 'special' search logic that inspects positional information to make sure the Tokens from a PhraseQuery are in order. I am not sure how exact this logic is compared to Lucenes PhraseQuery search logic, but a cursory look makes me think its not complete. It almost looks to me that it only does inorder with simple slop (not edit distance)...I am too lazy to check further though and I may have missed something. Also, LUCENE-663 does not support Span queries. This patch differs in that it fits the current Highlighter framework without modifying it, and it uses Lucene's own internal search logic to identify Spans for highlighting. PhraseQueries are handled by a SpanQuery approximation. As far as PhraseQuery/SpanQuery highlighting, I don't think any of the other Highlighter packages offer much. I think that things could be done a little faster, but that would require abandoning the current framework, and with all of the corner cases it now supports, I'd hate to see that. The other Highlighter code that is worth consideration is LUCENE-644. It does abandon the current Highlighter framework and goes with an attack that is much more efficient for larger documents: instead of attacking the problem by spinning through all of the document tokens and comparing to query tokens, 644 just looks at the tokens from the query and grabs the original text using the offsets from those tokens. This is darn fast, but doesnt go well with positional highlighting and I wonder how well it supports all of the corner cases that arise with overlapping tokens and whatnot. - Mark Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784 ] Grant Ingersoll commented on LUCENE-794: How should this relate to LUCENE-663? Seems like that one also covers other kinds of queries? I'm no expert in highlighting, but it seems like there is at least 3 different issues in JIRA for enabling things like phrase queries, etc. Should we try to consolidate these? Extend contrib Highlighter to properly support phrase queries and span queries -- Key: LUCENE-794 URL: https://issues.apache.org/jira/browse/LUCENE-794 Project: Lucene - Java Issue Type: Improvement Components: Other Reporter: Mark Miller Priority: Minor Attachments: spanhighlighter.patch, spanhighlighter10.patch, spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package that scores just like QueryScorer, but scores a 0 for Terms that did not cause the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts to fragment without breaking up Spans. See http://issues.apache.org/jira/browse/LUCENE-403 for some background. There is a dependency on MemoryIndex. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558803#action_12558803 ] Grant Ingersoll commented on LUCENE-794: Never mind, I went back and read the thread at http://lucene.markmail.org/message/p4gfxewk6jcqfxxj?q=highlighter+list:org%2Eapache%2Elucene%2Ejava-user which I think accounts for this approach and makes sense to me. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: spanhighlighter.patch, spanhighlighter10.patch, > spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, > spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, > spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, > spanhighlighter_patch_4.zip > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784 ] Grant Ingersoll commented on LUCENE-794: How should this relate to LUCENE-663? Seems like that one also covers other kinds of queries? I'm no expert in highlighting, but it seems like there is at least 3 different issues in JIRA for enabling things like phrase queries, etc. Should we try to consolidate these? > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: spanhighlighter.patch, spanhighlighter10.patch, > spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, > spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, > spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, > spanhighlighter_patch_4.zip > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 2.3 RC3 available for testing
Hi all, I just uploaded Lucene 2.3 RC3 to: http://people.apache.org/~buschmi/staging_area/lucene_2_3/ RC3 fixes a problem in the indexer that could cause it to hang after a disk full exception occurred. (see https://issues.apache.org/jira/browse/LUCENE-1130 for details). Please switch to RC3 and keep testing! -Michael - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-1131) Add numDeletedDocs to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic reassigned LUCENE-1131: Assignee: Otis Gospodnetic > Add numDeletedDocs to IndexReader > - > > Key: LUCENE-1131 > URL: https://issues.apache.org/jira/browse/LUCENE-1131 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Shai Erera >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1131.patch > > > Add numDeletedDocs to IndexReader. Basically, the implementation is as simple > as doing: > public int numDeletedDocs() { > return deletedDocs == null ? 0 : deletedDocs.count(); > } > in SegmentReader. > Patch to follow to include in all IndexReader extensions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Otis Gospodnetic updated LUCENE-400: Assignee: Otis Gospodnetic Thanks for bringing this up to date. I'll commit it after 2.3 is out. > NGramFilter -- construct n-grams from a TokenStream > --- > > Key: LUCENE-400 > URL: https://issues.apache.org/jira/browse/LUCENE-400 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: unspecified > Environment: Operating System: All > Platform: All >Reporter: Sebastian Kirsch >Assignee: Otis Gospodnetic >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, > NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java > > > This filter constructs n-grams (token combinations up to a fixed size, > sometimes > called "shingles") from a token stream. > The filter sets start offsets, end offsets and position increments, so > highlighting and phrase queries should work. > Position increments > 1 in the input stream are replaced by filler tokens > (tokens with termText "_" and endOffset - startOffset = 0) in the output > n-grams. (Position increments > 1 in the input stream are usually caused by > removing some tokens, eg. stopwords, from a stream.) > The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache > Commons-Collections. > Filter, test case and an analyzer are attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader
[ https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558727#action_12558727 ] Otis Gospodnetic commented on LUCENE-1131: -- I think maxDoc() is a cheap call, so calling it twice won't be a performance killer, esp. since this is not something you'd call frequently, I imagine. However, I do agree about numDeletedDocs() being nice for hiding implementation details. > Add numDeletedDocs to IndexReader > - > > Key: LUCENE-1131 > URL: https://issues.apache.org/jira/browse/LUCENE-1131 > Project: Lucene - Java > Issue Type: New Feature >Reporter: Shai Erera >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1131.patch > > > Add numDeletedDocs to IndexReader. Basically, the implementation is as simple > as doing: > public int numDeletedDocs() { > return deletedDocs == null ? 0 : deletedDocs.count(); > } > in SegmentReader. > Patch to follow to include in all IndexReader extensions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558717#action_12558717 ] Steven Rowe commented on LUCENE-400: Removed the duplicate link (to LUCENE-759), since that issue is about character-level n-grams, and this issue is about word-level n-grams. > NGramFilter -- construct n-grams from a TokenStream > --- > > Key: LUCENE-400 > URL: https://issues.apache.org/jira/browse/LUCENE-400 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: unspecified > Environment: Operating System: All > Platform: All >Reporter: Sebastian Kirsch >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, > NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java > > > This filter constructs n-grams (token combinations up to a fixed size, > sometimes > called "shingles") from a token stream. > The filter sets start offsets, end offsets and position increments, so > highlighting and phrase queries should work. > Position increments > 1 in the input stream are replaced by filler tokens > (tokens with termText "_" and endOffset - startOffset = 0) in the output > n-grams. (Position increments > 1 in the input stream are usually caused by > removing some tokens, eg. stopwords, from a stream.) > The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache > Commons-Collections. > Filter, test case and an analyzer are attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1127) TokenSources.getTokenStream(Document...)
[ https://issues.apache.org/jira/browse/LUCENE-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1127: Attachment: LUCENE-1127.patch > TokenSources.getTokenStream(Document...) > - > > Key: LUCENE-1127 > URL: https://issues.apache.org/jira/browse/LUCENE-1127 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1127.patch, LUCENE-1127.patch > > > Sometimes, one already has the Document, and just needs to generate a > TokenStream from it, so I am going to add a convenience method to > TokenSources. Sometimes, you also already have just the string, so I will > add a convenience method for that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Resolved: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
[ https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1130. Resolution: Fixed OK fixed & ported to 2.3 branch! > Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang > > > Key: LUCENE-1130 > URL: https://issues.apache.org/jira/browse/LUCENE-1130 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch > > > More testing of RC2 ... > I found one case, if you hit disk full during init() in > DocumentsWriter.ThreadState, when we first create the term vectors & > fields writer, such that subsequent calls to > IndexWriter.add/updateDocument will then hang forever. > What's happening in this case is we are incrementing nextDocID even > though we never call finishDocument (because we "thought" init did not > succeed). Then, when we finish the next document, it will never > actually write because missing finishDocument call never happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-1128) Add Highlighting benchmark support to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1128: Attachment: LUCENE-1128.patch I think this one is good. I have noticed w/ SVN that I was getting things like this from svn stat: {quote} A + src/java/org/apache/lucene/benchmark/byTask/tasks/SearchTravRetHighlightTask.java {quote} Which means that SVN thinks there is a history for the file. Turns out, it is from doing a copy of another file. Thus, I had to remove the file and then readd it. > Add Highlighting benchmark support to contrib/benchmark > --- > > Key: LUCENE-1128 > URL: https://issues.apache.org/jira/browse/LUCENE-1128 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1128.patch, LUCENE-1128.patch, LUCENE-1128.patch > > > I would like to be able to test the performance (speed, initially) of the > Highlighter in a standard way. Patch to follow that adds the Highlighter as > a dependency benchmark and adds in tasks extending the ReadTask to perform > highlighting on retrieved documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
[ https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558673#action_12558673 ] Michael Busch commented on LUCENE-1130: --- {quote} Thanks for testing Michael! {quote} I'll forward the thanks to my colleagues, they're doing a great job with testing the 2.3 RCs currently! Thank YOU for the quick fixes, Mike!! > Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang > > > Key: LUCENE-1130 > URL: https://issues.apache.org/jira/browse/LUCENE-1130 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch > > > More testing of RC2 ... > I found one case, if you hit disk full during init() in > DocumentsWriter.ThreadState, when we first create the term vectors & > fields writer, such that subsequent calls to > IndexWriter.add/updateDocument will then hang forever. > What's happening in this case is we are incrementing nextDocID even > though we never call finishDocument (because we "thought" init did not > succeed). Then, when we finish the next document, it will never > actually write because missing finishDocument call never happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1128) Add Highlighting benchmark support to contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558669#action_12558669 ] Mark Miller commented on LUCENE-1128: - Is it just me or does this patch seem to assume that a couple of new classes already exist? If so, any chance of getting a clean one? > Add Highlighting benchmark support to contrib/benchmark > --- > > Key: LUCENE-1128 > URL: https://issues.apache.org/jira/browse/LUCENE-1128 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-1128.patch, LUCENE-1128.patch > > > I would like to be able to test the performance (speed, initially) of the > Highlighter in a standard way. Patch to follow that adds the Highlighter as > a dependency benchmark and adds in tasks extending the ReadTask to perform > highlighting on retrieved documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558640#action_12558640 ] Karl Wettin commented on LUCENE-550: I was poking around in the javadocs of this and came to the conclution that InstantiatedIndexWriter is depricated code, that it is enough one can construct InstantiatedIndex using an optimized IndexReader. This makes all InstantiatedIndexes immutable. That makes the no-locks caveat to go away. Also, it is a hassle to make sure that InstantiatedIndexWriter work just as IndexWriter does. In the future, a segmented Directory-facade could be built on top of this, where each InstantiatedIndex is a segment created by IndexWriter flush. It would potentially be slower to populate this, but it would be compatible with everything. Adding more than one segement will requite merging and optimizing indices forth and back in RAMDirectories a but, but InstantiatedIndexes are usually quite small. It feels like much of that code is already there. On the matter of RAM consumption, using a profiler I recently noticed a 3.2MB directory of 3-5;3-3;3-5 ngrams with term vectors consumed something like 35MB RAM when loaded to an InstantiatedIndex. > InstantiatedIndex - faster but memory consuming index > - > > Key: LUCENE-550 > URL: https://issues.apache.org/jira/browse/LUCENE-550 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Affects Versions: 2.0.0 >Reporter: Karl Wettin >Assignee: Grant Ingersoll > Attachments: HitCollectionBench.jpg, > LUCENE-550_20071021_no_core_changes.txt, test-reports.zip > > > Represented as a coupled graph of class instances, this all-in-memory index > store implementation delivers search results up to a 100 times faster than > the file-centric RAMDirectory at the cost of greater RAM consumption. > Performance seems to be a little bit better than log2n (binary search). No > real data on that, just my eyes. > Populated with a single document InstantiatedIndex is almost, but not quite, > as fast as MemoryIndex. > At 20,000 document 10-50 characters long InstantiatedIndex outperforms > RAMDirectory some 30x, > 15x at 100 documents of 2000 charachters length, > and is linear to RAMDirectory at 10,000 documents of 2000 characters length. > Mileage may vary depending on term saturation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
[ https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558638#action_12558638 ] Michael McCandless commented on LUCENE-1130: OK I will commit today. Thanks for testing Michael! > Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang > > > Key: LUCENE-1130 > URL: https://issues.apache.org/jira/browse/LUCENE-1130 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch > > > More testing of RC2 ... > I found one case, if you hit disk full during init() in > DocumentsWriter.ThreadState, when we first create the term vectors & > fields writer, such that subsequent calls to > IndexWriter.add/updateDocument will then hang forever. > What's happening in this case is we are incrementing nextDocID even > though we never call finishDocument (because we "thought" init did not > succeed). Then, when we finish the next document, it will never > actually write because missing finishDocument call never happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-705) CompoundFileWriter should pre-set its file length
[ https://issues.apache.org/jira/browse/LUCENE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558637#action_12558637 ] Michael McCandless commented on LUCENE-705: --- OK I'll test on the major platforms, and take that approach. I'll tentatively target 2.4. > CompoundFileWriter should pre-set its file length > - > > Key: LUCENE-705 > URL: https://issues.apache.org/jira/browse/LUCENE-705 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > > I've read that if you are writing a large file, it's best to pre-set > the size of the file in advance before you write all of its contents. > This in general minimizes fragmentation and improves IO performance > against the file in the future. > I think this makes sense (intuitively) but I haven't done any real > performance testing to verify. > Java has the java.io.File.setLength() method (since 1.2) for this. > We can easily fix CompoundFileWriter to call setLength() on the file > it's writing (and add setLength() method to IndexOutput). The > CompoundFileWriter knows exactly how large its file will be. > Another good thing is: if you are going run out of disk space, then, > the setLength call should fail up front instead of failing when the > compound file is actually written. This has two benefits: first, you > find out sooner that you will run out of disk space, and, second, you > don't fill up the disk down to 0 bytes left (always a frustrating > experience!). Instead you leave what space was available > and throw an IOException. > My one hesitation here is: what if out there there exists a filesystem > that can't handle this call, and it throws an IOException on that > platform? But this is balanced against possible easy-win improvement > in performance. > Does anyone have any feedback / thoughts / experience relevant to > this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-705) CompoundFileWriter should pre-set its file length
[ https://issues.apache.org/jira/browse/LUCENE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-705: -- Fix Version/s: 2.4 > CompoundFileWriter should pre-set its file length > - > > Key: LUCENE-705 > URL: https://issues.apache.org/jira/browse/LUCENE-705 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: 2.1 >Reporter: Michael McCandless >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > > I've read that if you are writing a large file, it's best to pre-set > the size of the file in advance before you write all of its contents. > This in general minimizes fragmentation and improves IO performance > against the file in the future. > I think this makes sense (intuitively) but I haven't done any real > performance testing to verify. > Java has the java.io.File.setLength() method (since 1.2) for this. > We can easily fix CompoundFileWriter to call setLength() on the file > it's writing (and add setLength() method to IndexOutput). The > CompoundFileWriter knows exactly how large its file will be. > Another good thing is: if you are going run out of disk space, then, > the setLength call should fail up front instead of failing when the > compound file is actually written. This has two benefits: first, you > find out sooner that you will run out of disk space, and, second, you > don't fill up the disk down to 0 bytes left (always a frustrating > experience!). Instead you leave what space was available > and throw an IOException. > My one hesitation here is: what if out there there exists a filesystem > that can't handle this call, and it throws an IOException on that > platform? But this is balanced against possible easy-win improvement > in performance. > Does anyone have any feedback / thoughts / experience relevant to > this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-325: -- Fix Version/s: 2.4 > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-325: - Assignee: Michael McCandless (was: Lucene Developers) > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reassigned LUCENE-325: - Assignee: Michael McCandless (was: Lucene Developers) > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Michael McCandless >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558636#action_12558636 ] Michael McCandless commented on LUCENE-325: --- I think we should resurrect this: I agree it's useful. I'll take it & tentatively mark it 2.4 (hopefully I can make time by then!). The original patch would simply merge one segment "in place". I think we can improve this a bit by merging any adjacent series of segments that have deletions? This would still preserve docID ordering, but would also accomplish some merging as a side effect (I think a good thing). > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Lucene Developers >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558636#action_12558636 ] Michael McCandless commented on LUCENE-325: --- I think we should resurrect this: I agree it's useful. I'll take it & tentatively mark it 2.4 (hopefully I can make time by then!). The original patch would simply merge one segment "in place". I think we can improve this a bit by merging any adjacent series of segments that have deletions? This would still preserve docID ordering, but would also accomplish some merging as a side effect (I think a good thing). > [PATCH] new method expungeDeleted() added to IndexWriter > > > Key: LUCENE-325 > URL: https://issues.apache.org/jira/browse/LUCENE-325 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Affects Versions: CVS Nightly - Specify date in submission > Environment: Operating System: Windows XP > Platform: All >Reporter: John Wang >Assignee: Lucene Developers >Priority: Minor > Fix For: 2.4 > > Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, > TestExpungeDeleted.java > > > We make use the docIDs in lucene. I need a way to compact the docIDs in > segments > to remove the "holes" created from doing deletes. The only way to do this is > by > calling IndexWriter.optimize(). This is a very heavy call, for the cases where > the index is large but with very small number of deleted docs, calling > optimize > is not practical. > I need a new method: expungeDeleted(), which finds all the segments that have > delete documents and merge only those segments. > I have implemented this method and have discussed with Otis about submitting a > patch. I don't see where I can attached the patch. I will do according to the > patch guidleine and email the lucene mailing list. > Thanks > -John > I don't see a place where I can -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream
[ https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-400: --- Lucene Fields: [Patch Available] Fix Version/s: 2.4 Thanks, Steve. I will mark this as 2.4 > NGramFilter -- construct n-grams from a TokenStream > --- > > Key: LUCENE-400 > URL: https://issues.apache.org/jira/browse/LUCENE-400 > Project: Lucene - Java > Issue Type: Improvement > Components: Analysis >Affects Versions: unspecified > Environment: Operating System: All > Platform: All >Reporter: Sebastian Kirsch >Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, > NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java > > > This filter constructs n-grams (token combinations up to a fixed size, > sometimes > called "shingles") from a token stream. > The filter sets start offsets, end offsets and position increments, so > highlighting and phrase queries should work. > Position increments > 1 in the input stream are replaced by filler tokens > (tokens with termText "_" and endOffset - startOffset = 0) in the output > n-grams. (Position increments > 1 in the input stream are usually caused by > removing some tokens, eg. stopwords, from a stream.) > The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache > Commons-Collections. > Filter, test case and an analyzer are attached. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
[ https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558536#action_12558536 ] Michael Busch commented on LUCENE-1130: --- Mike, all core & contrib tests pass for me. Also the disk full test that I mentioned passes with your take2 patch. Without the patch it fails with RC2. So +1 for committing it to trunk & 2.3 branch! I'll build RC3 once this is committed. > Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang > > > Key: LUCENE-1130 > URL: https://issues.apache.org/jira/browse/LUCENE-1130 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.3 >Reporter: Michael McCandless >Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch > > > More testing of RC2 ... > I found one case, if you hit disk full during init() in > DocumentsWriter.ThreadState, when we first create the term vectors & > fields writer, such that subsequent calls to > IndexWriter.add/updateDocument will then hang forever. > What's happening in this case is we are incrementing nextDocID even > though we never call finishDocument (because we "thought" init did not > succeed). Then, when we finish the next document, it will never > actually write because missing finishDocument call never happens. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]