[jira] Commented: (LUCENE-1016) TermVectorAccessor, transparent vector space access

2008-01-14 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558935#action_12558935
 ] 

Karl Wettin commented on LUCENE-1016:
-

{quote}
I'm curious if the build part of this would be faster than reanalyzing a 
document.
{quote}

It is a slow process on an index with many terms. Each one has to be iterated 
and mached against the document number.

{quote}
Just thinking outloud, but I have wondering about a Highlighter that uses the 
new TermVectorMapper, but using that doesn't account for non-TermVector based 
Documents that need to be analyzed. Was thinking this might account for both 
cases, all through the TermVectorMapper mechanism. Just doesn't seem like it 
would be very fast.
{quote}

This patch is mostly about when you don't have access to the source data. It 
was used together with a TermVectorMappingCachedTokenStreamFactory to extract 
re-indexable documents from any directory.

If you think of this peice of code and highlighter together, I would consider 
something else, perhaps a tool that could add the term vector to all documents 
missing one in a single iteration sweep of the index. I know very little about 
the file format and the highlighter though.



> TermVectorAccessor, transparent vector space access 
> 
>
> Key: LUCENE-1016
> URL: https://issues.apache.org/jira/browse/LUCENE-1016
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Term Vectors
>Affects Versions: 2.2
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1016.txt
>
>
> This class visits TermVectorMapper and populates it with information 
> transparent by either passing it down to the default terms cache (documents 
> indexed with Field.TermVector) or by resolving the inverted index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-205) [PATCH] Patches for RussianAnalyzer

2008-01-14 Thread Vladimir Yuryev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558922#action_12558922
 ] 

Vladimir Yuryev commented on LUCENE-205:


Hi!
I with you agree, that CP1251 - a small problem if to consider lacks 
RussianAnalyzer as a whole. For example - grammatic analysis of words of 
Russian is made not truly or approximately similarly to English language 
and so on. Correct analysis of words would provide faster search of 
words and other advantages of work of the analyzer. Therefore I also see 
that you are right in the remark.

Vladimir Yuryev.

* "Grant Ingersoll (JIRA)" <[EMAIL PROTECTED]> [Sat, 12 Jan 2008 15:03:35 
https://issues.apache.org/jira/browse/LUCENE-205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
Perhaps
the
https://issues.apache.org/jira/browse/LUCENE-205

--
Vladimir Yuryev.

--
Rambler-ICQ 6 -- новый формат общения! http://icq.rambler.ru/


> [PATCH] Patches for RussianAnalyzer
> ---
>
> Key: LUCENE-205
> URL: https://issues.apache.org/jira/browse/LUCENE-205
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: other
> Platform: Other
>Reporter: Vladimir Yuryev
>Priority: Minor
> Attachments: RussianAnalyzer.patch.txt, 
> RussianLetterTokenizer.patch.txt, RussianLowerCaseFilter.patch.txt, 
> RussianStemFilter.patch.txt, TestRussianAnalyzer.patch.txt
>
>
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2008-01-14 Thread Grant Ingersoll


On Jan 14, 2008, at 4:49 PM, Mark Miller wrote:

While the overall framework of LUCENE-663 appears similar to the  
current contrib Highlighter, the code is actually quite different  
and I do not think it handles as many corner cases in its current  
state. LUCENE-663 supports PhraseQuerys by implementing 'special'  
search logic that inspects positional information to make sure the  
Tokens from a PhraseQuery are in order. I am not sure how exact this  
logic is compared to Lucenes PhraseQuery search logic, but a cursory  
look makes me think its not complete. It almost looks to me that it  
only does inorder with simple slop (not edit distance)...I am too  
lazy to check further though and I may have missed something. Also,  
LUCENE-663 does not support Span queries.


This patch differs in that it fits the current Highlighter framework  
without modifying it, and it uses Lucene's own internal search logic  
to identify Spans for highlighting. PhraseQueries are handled by a  
SpanQuery approximation.


As far as PhraseQuery/SpanQuery highlighting, I don't think any of  
the other Highlighter packages offer much. I think that things could  
be done a little faster, but that would require abandoning the  
current framework, and with all of the corner cases it now supports,  
I'd hate to see that.


The other Highlighter code that is worth consideration is  
LUCENE-644. It does abandon the current Highlighter framework and  
goes with an attack that is much more efficient for larger  
documents: instead of attacking the problem by spinning through all  
of the document tokens and comparing to query tokens, 644 just looks  
at the tokens from the query and grabs the original text using the  
offsets from those tokens. This is darn fast, but doesnt go well  
with positional highlighting and I wonder how well it supports all  
of the corner cases that arise with overlapping tokens and whatnot.


Hmm, I'm beginning to think that the performance issue may be overcome  
to some extent with the new TermVectorMapper stuff.  Basic idea is  
that you construct a highlighter that does the appropriate  
highlighting as the TV is being loaded from disk through the Map  
function.  This would save having to go back through all the tokens a  
second time, but probably has other issues.  It's just a thought in my  
head at this point.  At a minimum, I think the TVM could speed up the  
TokenSources part that creates the TokenStream based on the TermVector.


At any rate, I am going to think some more on it.

-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2008-01-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558856#action_12558856
 ] 

Michael Busch commented on LUCENE-584:
--

I think I understand now which problems you had when you wanted to 
change BooleanFilter and xml-query-parser to use the new Filter APIs.

BooleanFilter is optimized to utilize BitSets for performing boolean
operations fast. Now if we change BooleanFilter to use the new 
DocIdSetIterator, then it can't use the fast BitSet operations (e. g.
union for or, intersect for and) anymore. 

Now we can introduce BitSetFilter as you suggested and what I did in
the take4 patch. But here's the problem: Introducing subclasses of 
Filter doesn't play nicely with the caching mechanism in Lucene.
For example: if we change BooleanFilter to only work with 
BitSetFilters, then it won't work with a CachingWrapperFilter anymore,
because CachingWrapperFilter extends Filter. Then we would have to
introduce new CachingWrapper***Filter, for the different kinds of
Filter subclasses, which is a bad design as Mark pointed out in his
comment: 
https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12547901#action_12547901

One solution would be to add a getBitSet() method to DocIdBitSet.
DocIdBitSet is a new class that is basically just a wrapper around a
Java BitSet and provides a DocIdSetIterator to access the BitSet.

Then BooleanFilter could do something like this:
{code:java}
DocIdSet docIdSet = filter.getDocIdSet();
if (docIdSet instanceof DocIdBitSet) {
  BitSet bits = ((DocIdBitSet) docIdSet).getBitSet();
  ... // existing code
} else {
  throw new UnsupportedOperationException("BooleanFilter only 
  supports Filters that use DocIdBitSet.");
}
{code}

But then, changing the core filters to use OpenBitSets instead of
Java BitSets is technically an API change, because BooleanFilter
would not work anymore with the core filters.

So if we took this approach we would have to wait until 3.0 to move
the core from BitSet to OpenBitSet and also change BooleanFilter 
then to use OpenBitSets. BooleanFilter could then also work with
either of the two BitSet implementions, but probably not with those
two mixed.

Any feedback about this is very welcome. I'll try to further think
about how to marry the new Filter API, caching mechanism and Filter
implementations like BooleanFilter nicely.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: bench-diff.txt, bench-diff.txt, 
> ContribQueries20080111.patch, lucene-584-take2.patch, 
> lucene-584-take3-part1.patch, lucene-584-take3-part2.patch, 
> lucene-584-take4-part1.patch, lucene-584-take4-part2.patch, lucene-584.patch, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip, Test20080111.patch
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-494) Analyzer for preventing overload of search service by queries with common terms in large indexes

2008-01-14 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558854#action_12558854
 ] 

Mark Harwood commented on LUCENE-494:
-

I personally don't use this but others may. It was easier to solve my 
particular problem by adding stop words to my XSL query templates (I added 
support to the XMLQueryParser for the "FuzzyLikeThisQuery" tag to take stop 
words). This was more about ease of configuration in my particular app.

I know Nutch has something similar implemented elsewhere - maybe in the query 
parser.

I also had the notion that wrapping IndexReader to auto-cache TermDocs for 
super-popular terms using a BitSet would be a good way to avoid the IO 
overhead. This Bitset wouldn't help resolve positional queries e.g. phrase/span 
queries which need a TermPositions implementation but would work for straight 
TermQueries.



> Analyzer for preventing overload of search service by queries with common 
> terms in large indexes
> 
>
> Key: LUCENE-494
> URL: https://issues.apache.org/jira/browse/LUCENE-494
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: QueryAutoStopWordAnalyzer.java, 
> QueryAutoStopWordAnalyzerTest.java
>
>
> An analyzer used primarily at query time to wrap another analyzer and provide 
> a layer of protection
> which prevents very common words from being passed into queries. For very 
> large indexes the cost
> of reading TermDocs for a very common word can be  high. This analyzer was 
> created after experience with
> a 38 million doc index which had a term in around 50% of docs and was causing 
> TermQueries for 
> this term to take 2 seconds.
> Use the various "addStopWords" methods in this class to automate the 
> identification and addition of 
> stop words found in an already existing index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Issue Comment Edited: (LUCENE-1016) TermVectorAccessor, transparent vector space access

2008-01-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558825#action_12558825
 ] 

gsingers edited comment on LUCENE-1016 at 1/14/08 2:57 PM:
--

I'm curious if the build part of this would be faster than reanalyzing a 
document.  Just thinking outloud, but I have wondering about a Highlighter that 
uses the new TermVectorMapper, but using that doesn't account for 
non-TermVector based Documents that need to be analyzed.  Was thinking this 
might account for both cases, all through the TermVectorMapper mechanism.  Just 
doesn't seem like it would be very fast.

  was (Author: gsingers):
I'm curious if the build part of this would be faster than reanalyzing a 
document.  Just thinking outloud, but I have wondering about a Highlighter that 
uses the new TermVectorMapper, but that doesn't account for non-TermVector 
based.  Was thinking this might account for both cases.
  
> TermVectorAccessor, transparent vector space access 
> 
>
> Key: LUCENE-1016
> URL: https://issues.apache.org/jira/browse/LUCENE-1016
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Term Vectors
>Affects Versions: 2.2
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1016.txt
>
>
> This class visits TermVectorMapper and populates it with information 
> transparent by either passing it down to the default terms cache (documents 
> indexed with Field.TermVector) or by resolving the inverted index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1016) TermVectorAccessor, transparent vector space access

2008-01-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558825#action_12558825
 ] 

Grant Ingersoll commented on LUCENE-1016:
-

I'm curious if the build part of this would be faster than reanalyzing a 
document.  Just thinking outloud, but I have wondering about a Highlighter that 
uses the new TermVectorMapper, but that doesn't account for non-TermVector 
based.  Was thinking this might account for both cases.

> TermVectorAccessor, transparent vector space access 
> 
>
> Key: LUCENE-1016
> URL: https://issues.apache.org/jira/browse/LUCENE-1016
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Term Vectors
>Affects Versions: 2.2
>Reporter: Karl Wettin
>Priority: Minor
> Attachments: LUCENE-1016.txt
>
>
> This class visits TermVectorMapper and populates it with information 
> transparent by either passing it down to the default terms cache (documents 
> indexed with Field.TermVector) or by resolving the inverted index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2008-01-14 Thread Michael Goddard (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558819#action_12558819
 ] 

Michael Goddard commented on LUCENE-794:


Mark,

I've still got a little work to do on it, but would like to also include 
support for highlighting of RangeQuery within SpanNearQuery.  I have a new 
SpanQuery subclass which helps, and will post that to see if it merits 
inclusion within Lucene.  In conjunction with that, I'd have one last "else if" 
clause to add to the patch covered by this issue.  Basically, I'm trying to 
make a case for the work covered in this Jira issue being committed, since it's 
very useful to me.


> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
> spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
> spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
> spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
> spanhighlighter_patch_4.zip
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2008-01-14 Thread Mark Miller
While the overall framework of LUCENE-663 appears similar to the current 
contrib Highlighter, the code is actually quite different and I do not 
think it handles as many corner cases in its current state. LUCENE-663 
supports PhraseQuerys by implementing 'special' search logic that 
inspects positional information to make sure the Tokens from a 
PhraseQuery are in order. I am not sure how exact this logic is compared 
to Lucenes PhraseQuery search logic, but a cursory look makes me think 
its not complete. It almost looks to me that it only does inorder with 
simple slop (not edit distance)...I am too lazy to check further though 
and I may have missed something. Also, LUCENE-663 does not support Span 
queries.


This patch differs in that it fits the current Highlighter framework 
without modifying it, and it uses Lucene's own internal search logic to 
identify Spans for highlighting. PhraseQueries are handled by a 
SpanQuery approximation.


As far as PhraseQuery/SpanQuery highlighting, I don't think any of the 
other Highlighter packages offer much. I think that things could be done 
a little faster, but that would require abandoning the current 
framework, and with all of the corner cases it now supports, I'd hate to 
see that.


The other Highlighter code that is worth consideration is LUCENE-644. It 
does abandon the current Highlighter framework and goes with an attack 
that is much more efficient for larger documents: instead of attacking 
the problem by spinning through all of the document tokens and comparing 
to query tokens, 644 just looks at the tokens from the query and grabs 
the original text using the offsets from those tokens. This is darn 
fast, but doesnt go well with positional highlighting and I wonder how 
well it supports all of the corner cases that arise with overlapping 
tokens and whatnot.


- Mark

Grant Ingersoll (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784 ] 


Grant Ingersoll commented on LUCENE-794:


How should this relate to LUCENE-663?  Seems like that one also covers other 
kinds of queries?  I'm no expert in highlighting, but it seems like there is at 
least 3 different issues in JIRA for enabling things like phrase queries, etc.  
 Should we try to consolidate these?

  

Extend contrib Highlighter to properly support phrase queries and span queries
--

Key: LUCENE-794
URL: https://issues.apache.org/jira/browse/LUCENE-794
Project: Lucene - Java
 Issue Type: Improvement
 Components: Other
   Reporter: Mark Miller
   Priority: Minor
Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
spanhighlighter_patch_4.zip


This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter package 
that scores just like QueryScorer, but scores a 0 for Terms that did not cause 
the Query hit. This gives 'actual' hit highlighting for the range of SpanQuerys 
and PhraseQuery. There is also a new Fragmenter that attempts to fragment 
without breaking up Spans.
See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
There is a dependency on MemoryIndex.



  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2008-01-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558803#action_12558803
 ] 

Grant Ingersoll commented on LUCENE-794:


Never mind, I went back and read the thread at 
http://lucene.markmail.org/message/p4gfxewk6jcqfxxj?q=highlighter+list:org%2Eapache%2Elucene%2Ejava-user
which I think accounts for this approach and makes sense to me.

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
> spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
> spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
> spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
> spanhighlighter_patch_4.zip
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2008-01-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558784#action_12558784
 ] 

Grant Ingersoll commented on LUCENE-794:


How should this relate to LUCENE-663?  Seems like that one also covers other 
kinds of queries?  I'm no expert in highlighting, but it seems like there is at 
least 3 different issues in JIRA for enabling things like phrase queries, etc.  
 Should we try to consolidate these?

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
> spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
> spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
> spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
> spanhighlighter_patch_4.zip
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene 2.3 RC3 available for testing

2008-01-14 Thread Michael Busch
Hi all,

I just uploaded Lucene 2.3 RC3 to:
http://people.apache.org/~buschmi/staging_area/lucene_2_3/

RC3 fixes a problem in the indexer that could cause it to hang after a
disk full exception occurred. (see
https://issues.apache.org/jira/browse/LUCENE-1130 for details).

Please switch to RC3 and keep testing!
-Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-01-14 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic reassigned LUCENE-1131:


Assignee: Otis Gospodnetic

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream

2008-01-14 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-400:


Assignee: Otis Gospodnetic

Thanks for bringing this up to date.  I'll commit it after 2.3 is out.


> NGramFilter -- construct n-grams from a TokenStream
> ---
>
> Key: LUCENE-400
> URL: https://issues.apache.org/jira/browse/LUCENE-400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: unspecified
> Environment: Operating System: All
> Platform: All
>Reporter: Sebastian Kirsch
>Assignee: Otis Gospodnetic
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, 
> NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java
>
>
> This filter constructs n-grams (token combinations up to a fixed size, 
> sometimes
> called "shingles") from a token stream.
> The filter sets start offsets, end offsets and position increments, so
> highlighting and phrase queries should work.
> Position increments > 1 in the input stream are replaced by filler tokens
> (tokens with termText "_" and endOffset - startOffset = 0) in the output
> n-grams. (Position increments > 1 in the input stream are usually caused by
> removing some tokens, eg. stopwords, from a stream.)
> The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
> Commons-Collections.
> Filter, test case and an analyzer are attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1131) Add numDeletedDocs to IndexReader

2008-01-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558727#action_12558727
 ] 

Otis Gospodnetic commented on LUCENE-1131:
--

I think maxDoc() is a cheap call, so calling it twice won't be a performance 
killer, esp. since this is not something you'd call frequently, I imagine.

However, I do agree about numDeletedDocs() being nice for hiding implementation 
details.

> Add numDeletedDocs to IndexReader
> -
>
> Key: LUCENE-1131
> URL: https://issues.apache.org/jira/browse/LUCENE-1131
> Project: Lucene - Java
>  Issue Type: New Feature
>Reporter: Shai Erera
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1131.patch
>
>
> Add numDeletedDocs to IndexReader. Basically, the implementation is as simple 
> as doing:
> public int numDeletedDocs() {
>   return deletedDocs == null ? 0 : deletedDocs.count();
> }
> in SegmentReader.
> Patch to follow to include in all IndexReader extensions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream

2008-01-14 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558717#action_12558717
 ] 

Steven Rowe commented on LUCENE-400:


Removed the duplicate link (to LUCENE-759), since that issue is about 
character-level n-grams, and this issue is about word-level n-grams.

> NGramFilter -- construct n-grams from a TokenStream
> ---
>
> Key: LUCENE-400
> URL: https://issues.apache.org/jira/browse/LUCENE-400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: unspecified
> Environment: Operating System: All
> Platform: All
>Reporter: Sebastian Kirsch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, 
> NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java
>
>
> This filter constructs n-grams (token combinations up to a fixed size, 
> sometimes
> called "shingles") from a token stream.
> The filter sets start offsets, end offsets and position increments, so
> highlighting and phrase queries should work.
> Position increments > 1 in the input stream are replaced by filler tokens
> (tokens with termText "_" and endOffset - startOffset = 0) in the output
> n-grams. (Position increments > 1 in the input stream are usually caused by
> removing some tokens, eg. stopwords, from a stream.)
> The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
> Commons-Collections.
> Filter, test case and an analyzer are attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1127) TokenSources.getTokenStream(Document...)

2008-01-14 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1127:


Attachment: LUCENE-1127.patch

> TokenSources.getTokenStream(Document...) 
> -
>
> Key: LUCENE-1127
> URL: https://issues.apache.org/jira/browse/LUCENE-1127
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1127.patch, LUCENE-1127.patch
>
>
> Sometimes, one already has the Document, and just needs to generate a 
> TokenStream from it, so I am going to add a convenience method to 
> TokenSources.  Sometimes, you also already have just the string, so I will 
> add a convenience method for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang

2008-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1130.


Resolution: Fixed

OK fixed & ported to 2.3 branch!

> Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
> 
>
> Key: LUCENE-1130
> URL: https://issues.apache.org/jira/browse/LUCENE-1130
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch
>
>
> More testing of RC2 ...
> I found one case, if you hit disk full during init() in
> DocumentsWriter.ThreadState, when we first create the term vectors &
> fields writer, such that subsequent calls to
> IndexWriter.add/updateDocument will then hang forever.
> What's happening in this case is we are incrementing nextDocID even
> though we never call finishDocument (because we "thought" init did not
> succeed).  Then, when we finish the next document, it will never
> actually write because missing finishDocument call never happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1128) Add Highlighting benchmark support to contrib/benchmark

2008-01-14 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1128:


Attachment: LUCENE-1128.patch

I think this one is good.  I have noticed w/ SVN that I was getting things like 
this from svn stat:
{quote}
A  +   
src/java/org/apache/lucene/benchmark/byTask/tasks/SearchTravRetHighlightTask.java
{quote}

Which means that SVN thinks there is a history for the file.  Turns out, it is 
from doing a copy of another file.  Thus, I had to remove the file and then 
readd it.



> Add Highlighting benchmark support to contrib/benchmark
> ---
>
> Key: LUCENE-1128
> URL: https://issues.apache.org/jira/browse/LUCENE-1128
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1128.patch, LUCENE-1128.patch, LUCENE-1128.patch
>
>
> I would like to be able to test the performance (speed, initially) of the 
> Highlighter in a standard way.  Patch to follow that adds the Highlighter as 
> a dependency benchmark and adds in tasks extending the ReadTask to perform 
> highlighting on retrieved documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang

2008-01-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558673#action_12558673
 ] 

Michael Busch commented on LUCENE-1130:
---

{quote}
Thanks for testing Michael!
{quote}

I'll forward the thanks to my colleagues, they're doing a great job with 
testing the 2.3 RCs currently!

Thank YOU for the quick fixes, Mike!!

> Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
> 
>
> Key: LUCENE-1130
> URL: https://issues.apache.org/jira/browse/LUCENE-1130
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch
>
>
> More testing of RC2 ...
> I found one case, if you hit disk full during init() in
> DocumentsWriter.ThreadState, when we first create the term vectors &
> fields writer, such that subsequent calls to
> IndexWriter.add/updateDocument will then hang forever.
> What's happening in this case is we are incrementing nextDocID even
> though we never call finishDocument (because we "thought" init did not
> succeed).  Then, when we finish the next document, it will never
> actually write because missing finishDocument call never happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1128) Add Highlighting benchmark support to contrib/benchmark

2008-01-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558669#action_12558669
 ] 

Mark Miller commented on LUCENE-1128:
-

Is it just me or does this patch seem to assume that a couple of new classes 
already exist?

If so, any chance of getting a clean one?

> Add Highlighting benchmark support to contrib/benchmark
> ---
>
> Key: LUCENE-1128
> URL: https://issues.apache.org/jira/browse/LUCENE-1128
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-1128.patch, LUCENE-1128.patch
>
>
> I would like to be able to test the performance (speed, initially) of the 
> Highlighter in a standard way.  Patch to follow that adds the Highlighter as 
> a dependency benchmark and adds in tasks extending the ReadTask to perform 
> highlighting on retrieved documents.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-550) InstantiatedIndex - faster but memory consuming index

2008-01-14 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558640#action_12558640
 ] 

Karl Wettin commented on LUCENE-550:


I was poking around in the javadocs of this and came to the conclution that 
InstantiatedIndexWriter is depricated code, that it is enough one can construct 
InstantiatedIndex using an optimized IndexReader. This makes all 
InstantiatedIndexes immutable. That makes the no-locks caveat to go away.

Also, it is a hassle to make sure that InstantiatedIndexWriter work just as 
IndexWriter does.

In the future, a segmented Directory-facade could be built on top of this, 
where each InstantiatedIndex is a segment created by IndexWriter flush. It 
would potentially be slower to populate this, but it would be compatible with 
everything. Adding more than one segement will requite merging and optimizing 
indices forth and back in RAMDirectories a but, but InstantiatedIndexes are 
usually quite small.

It feels like much of that code is already there.

On the matter of RAM consumption, using a profiler I recently noticed a 3.2MB 
directory of 3-5;3-3;3-5 ngrams with term vectors consumed something like 35MB 
RAM when loaded to an InstantiatedIndex.




> InstantiatedIndex - faster but memory consuming index
> -
>
> Key: LUCENE-550
> URL: https://issues.apache.org/jira/browse/LUCENE-550
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.0.0
>Reporter: Karl Wettin
>Assignee: Grant Ingersoll
> Attachments: HitCollectionBench.jpg, 
> LUCENE-550_20071021_no_core_changes.txt, test-reports.zip
>
>
> Represented as a coupled graph of class instances, this all-in-memory index 
> store implementation delivers search results up to a 100 times faster than 
> the file-centric RAMDirectory at the cost of greater RAM consumption.
> Performance seems to be a little bit better than log2n (binary search). No 
> real data on that, just my eyes.
> Populated with a single document InstantiatedIndex is almost, but not quite, 
> as fast as MemoryIndex.
> At 20,000 document 10-50 characters long InstantiatedIndex outperforms 
> RAMDirectory some 30x,
> 15x at 100 documents of 2000 charachters length,
> and is linear to RAMDirectory at 10,000 documents of 2000 characters length.
> Mileage may vary depending on term saturation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang

2008-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558638#action_12558638
 ] 

Michael McCandless commented on LUCENE-1130:


OK I will commit today.  Thanks for testing Michael!

> Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
> 
>
> Key: LUCENE-1130
> URL: https://issues.apache.org/jira/browse/LUCENE-1130
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch
>
>
> More testing of RC2 ...
> I found one case, if you hit disk full during init() in
> DocumentsWriter.ThreadState, when we first create the term vectors &
> fields writer, such that subsequent calls to
> IndexWriter.add/updateDocument will then hang forever.
> What's happening in this case is we are incrementing nextDocID even
> though we never call finishDocument (because we "thought" init did not
> succeed).  Then, when we finish the next document, it will never
> actually write because missing finishDocument call never happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-705) CompoundFileWriter should pre-set its file length

2008-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558637#action_12558637
 ] 

Michael McCandless commented on LUCENE-705:
---

OK I'll test on the major platforms, and take that approach.  I'll tentatively 
target 2.4.

> CompoundFileWriter should pre-set its file length
> -
>
> Key: LUCENE-705
> URL: https://issues.apache.org/jira/browse/LUCENE-705
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
>
> I've read that if you are writing a large file, it's best to pre-set
> the size of the file in advance before you write all of its contents.
> This in general minimizes fragmentation and improves IO performance
> against the file in the future.
> I think this makes sense (intuitively) but I haven't done any real
> performance testing to verify.
> Java has the java.io.File.setLength() method (since 1.2) for this.
> We can easily fix CompoundFileWriter to call setLength() on the file
> it's writing (and add setLength() method to IndexOutput).  The
> CompoundFileWriter knows exactly how large its file will be.
> Another good thing is: if you are going run out of disk space, then,
> the setLength call should fail up front instead of failing when the
> compound file is actually written.  This has two benefits: first, you
> find out sooner that you will run out of disk space, and, second, you
> don't fill up the disk down to 0 bytes left (always a frustrating
> experience!).  Instead you leave what space was available
> and throw an IOException.
> My one hesitation here is: what if out there there exists a filesystem
> that can't handle this call, and it throws an IOException on that
> platform?  But this is balanced against possible easy-win improvement
> in performance.
> Does anyone have any feedback / thoughts / experience relevant to
> this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-705) CompoundFileWriter should pre-set its file length

2008-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-705:
--

Fix Version/s: 2.4

> CompoundFileWriter should pre-set its file length
> -
>
> Key: LUCENE-705
> URL: https://issues.apache.org/jira/browse/LUCENE-705
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
>
> I've read that if you are writing a large file, it's best to pre-set
> the size of the file in advance before you write all of its contents.
> This in general minimizes fragmentation and improves IO performance
> against the file in the future.
> I think this makes sense (intuitively) but I haven't done any real
> performance testing to verify.
> Java has the java.io.File.setLength() method (since 1.2) for this.
> We can easily fix CompoundFileWriter to call setLength() on the file
> it's writing (and add setLength() method to IndexOutput).  The
> CompoundFileWriter knows exactly how large its file will be.
> Another good thing is: if you are going run out of disk space, then,
> the setLength call should fail up front instead of failing when the
> compound file is actually written.  This has two benefits: first, you
> find out sooner that you will run out of disk space, and, second, you
> don't fill up the disk down to 0 bytes left (always a frustrating
> experience!).  Instead you leave what space was available
> and throw an IOException.
> My one hesitation here is: what if out there there exists a filesystem
> that can't handle this call, and it throws an IOException on that
> platform?  But this is balanced against possible easy-win improvement
> in performance.
> Does anyone have any feedback / thoughts / experience relevant to
> this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-325:
--

Fix Version/s: 2.4

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-325:
-

Assignee: Michael McCandless  (was: Lucene Developers)

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-325:
-

Assignee: Michael McCandless  (was: Lucene Developers)

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558636#action_12558636
 ] 

Michael McCandless commented on LUCENE-325:
---

I think we should resurrect this: I agree it's useful.  I'll take it & 
tentatively mark it 2.4 (hopefully I can make time by then!).

The original patch would simply merge one segment "in place".  I think we can 
improve this a bit by merging any adjacent series of segments that have 
deletions?  This would still preserve docID ordering, but would also accomplish 
some merging as a side effect (I think a good thing).

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Lucene Developers
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558636#action_12558636
 ] 

Michael McCandless commented on LUCENE-325:
---

I think we should resurrect this: I agree it's useful.  I'll take it & 
tentatively mark it 2.4 (hopefully I can make time by then!).

The original patch would simply merge one segment "in place".  I think we can 
improve this a bit by merging any adjacent series of segments that have 
deletions?  This would still preserve docID ordering, but would also accomplish 
some merging as a side effect (I think a good thing).

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Lucene Developers
>Priority: Minor
> Fix For: 2.4
>
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream

2008-01-14 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-400:
---

Lucene Fields: [Patch Available]
Fix Version/s: 2.4

Thanks, Steve.  I will mark this as 2.4

> NGramFilter -- construct n-grams from a TokenStream
> ---
>
> Key: LUCENE-400
> URL: https://issues.apache.org/jira/browse/LUCENE-400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: unspecified
> Environment: Operating System: All
> Platform: All
>Reporter: Sebastian Kirsch
>Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, 
> NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java
>
>
> This filter constructs n-grams (token combinations up to a fixed size, 
> sometimes
> called "shingles") from a token stream.
> The filter sets start offsets, end offsets and position increments, so
> highlighting and phrase queries should work.
> Position increments > 1 in the input stream are replaced by filler tokens
> (tokens with termText "_" and endOffset - startOffset = 0) in the output
> n-grams. (Position increments > 1 in the input stream are usually caused by
> removing some tokens, eg. stopwords, from a stream.)
> The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
> Commons-Collections.
> Filter, test case and an analyzer are attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang

2008-01-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558536#action_12558536
 ] 

Michael Busch commented on LUCENE-1130:
---

Mike,

all core & contrib tests pass for me. 
Also the disk full test that I mentioned passes with your take2 patch. 
Without the patch it fails with RC2.

So +1 for committing it to trunk & 2.3 branch!
I'll build RC3 once this is committed.

> Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
> 
>
> Key: LUCENE-1130
> URL: https://issues.apache.org/jira/browse/LUCENE-1130
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch
>
>
> More testing of RC2 ...
> I found one case, if you hit disk full during init() in
> DocumentsWriter.ThreadState, when we first create the term vectors &
> fields writer, such that subsequent calls to
> IndexWriter.add/updateDocument will then hang forever.
> What's happening in this case is we are incrementing nextDocID even
> though we never call finishDocument (because we "thought" init did not
> succeed).  Then, when we finish the next document, it will never
> actually write because missing finishDocument call never happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]