[jira] Updated: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream

2008-01-13 Thread Steven Rowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rowe updated LUCENE-400:
---

Attachment: LUCENE-400.patch

Repackaged these four files as a patch, with the following modifications to the 
code:

* Renamed files and variables to refer to "n-grams" as "shingles", to avoid 
confusion with the character-level n-gram code already in Lucene's sandbox
* Placed code in the o.a.l.analysis.shingle package
* Converted commons-collections FIFO usages to LinkedLists
* Removed @author from javadocs
* Changed deprecated Lucene API usages to alternate forms; addressed all 
compilation warnings
* Changed code style to conform to Lucene conventions
* Changed field setters to return null instead of a reference to the class 
instance, then changed instantiations to use individual setter calls instead of 
the chained calling style
* Added ASF license to each file

All tests pass.

Although I left in the ShingleAnalyzerWrapper and its test in the patch, no 
other Lucene filter (AFAICT) has such a filter wrapping facility.  My vote is 
to remove these two files.

> NGramFilter -- construct n-grams from a TokenStream
> ---
>
> Key: LUCENE-400
> URL: https://issues.apache.org/jira/browse/LUCENE-400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: unspecified
> Environment: Operating System: All
> Platform: All
>Reporter: Sebastian Kirsch
>Priority: Minor
> Attachments: LUCENE-400.patch, NGramAnalyzerWrapper.java, 
> NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java
>
>
> This filter constructs n-grams (token combinations up to a fixed size, 
> sometimes
> called "shingles") from a token stream.
> The filter sets start offsets, end offsets and position increments, so
> highlighting and phrase queries should work.
> Position increments > 1 in the input stream are replaced by filler tokens
> (tokens with termText "_" and endOffset - startOffset = 0) in the output
> n-grams. (Position increments > 1 in the input stream are usually caused by
> removing some tokens, eg. stopwords, from a stream.)
> The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
> Commons-Collections.
> Filter, test case and an analyzer are attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A bit of planning

2008-01-13 Thread DM Smith


On Jan 12, 2008, at 6:35 PM, Chris Hostetter wrote:



: Hmm, actually this is probably too restrictive. But maybe we could  
say

: that Lucene 3.0 doesn't have to be able to read indexes built with
: versions older than 2.0?

that is in fact the position that lucene has had since as long as  
i've ben

involved with it...

http://wiki.apache.org/lucene-java/BackwardsCompatibility


File formats are back-compatible between major versions. Version X.N
should be able to read indexes generated by any version after and
including version X-1.0, but may-or-may-not be able to read indexes
generated by version X-2.N.


3.X must be able to read file created by 2.Y (where X and Y can be any
number)



If I remember right, the file format changed in 2.1, such that 2.0  
could not read a 2.1 index.


I seem to recall that 2.0 was 1.9 with the deprecations removed and  
perhaps some minor changes.


I think we are going to need to do similar approach to change the file  
formats but especially to go to Java 5.


Release 3.0 will be the first to require Java 5. But we can't do it as  
we did before, where the new api was introduced in the x.9 release,  
with the old marked as deprecated. To do so would make 2.9 not be a  
drop in replacement for 2.4.


I'd like to recommend that 3.0 contain the new Java 5 API changes and  
what it replaces be marked deprecated. 3.0 would also remove what was  
deprecated in 2.9. Then in 3.1 we remove the deprecations.


While I was very vocal against going to Java 5, I'm very agreeable  
with the change now. But, I'd like to see it be done in a consistent,  
deliberate and well thought out manner. And I'd like to help out where  
I can.


I still hope to "back port" any changes to Java 1.4.2, but with how  
well the 2.x series performs, I am much less inclined to do so. Lucene  
2.x is most excellent!!


-- DM Smith



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

2008-01-13 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558432#action_12558432
 ] 

Michael Busch commented on LUCENE-532:
--

I think LUCENE-783 (move all file headers to segments file) would solve this 
issue nicely. Then there would not be the need to call seek() in CFSWriter and 
TermInfosWriter anymore. I'd love to work on 783, but not sure if time permits 
in the near future.

> [PATCH] Indexing on Hadoop distributed file system
> --
>
> Key: LUCENE-532
> URL: https://issues.apache.org/jira/browse/LUCENE-532
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9
>Reporter: Igor Bolotin
>Priority: Minor
> Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, 
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on 
> Hadoop distributed file system. When we tried to do it directly on DFS using 
> Nutch FsDirectory class - we immediately found that indexing fails because 
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
> for this behavior is clear - DFS does not support random updates and so 
> seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need 
> them? Search in the Lucene code revealed 2 places which call 
> IndexOutput.seek() method: one is in TermInfosWriter and another one in 
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the 
> only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total 
> number of terms in the file back into the beginning of the file. It was very 
> simple to change file format a little bit and write number of terms into last 
> 8 bytes of the file instead of writing them into beginning of file. The only 
> other place that should be fixed in order for this to work is in 
> SegmentTermEnum constructor - to read this piece of information at position = 
> file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index 
> directly to DFS without any problems. Well - we still don't index directly to 
> DFS for performance reasons, but at least we can build small local indexes 
> and merge them into the main index on DFS without copying big main index back 
> and forth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-753) Use NIO positional read to avoid synchronization in FSIndexInput

2008-01-13 Thread Nat (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558425#action_12558425
 ] 

Nat commented on LUCENE-753:


I think bufsize has way much bigger impact than the implementation. I found 
that 64KB buffer size is at least 5-6 times faster than 1KB. Should we tune 
this parameter instead for maximum performance.

> Use NIO positional read to avoid synchronization in FSIndexInput
> 
>
> Key: LUCENE-753
> URL: https://issues.apache.org/jira/browse/LUCENE-753
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Yonik Seeley
> Attachments: FileReadTest.java, FileReadTest.java, FileReadTest.java, 
> FSIndexInput.patch, FSIndexInput.patch
>
>
> As suggested by Doug, we could use NIO pread to avoid synchronization on the 
> underlying file.
> This could mitigate any MT performance drop caused by reducing the number of 
> files in the index format.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558413#action_12558413
 ] 

Grant Ingersoll commented on LUCENE-325:


This seems generally useful.  I imagine, though, that the patch is way out of 
date.  I wonder if the new ability to merge some segments might have an option 
to do this kind of thing.

Any thoughts on resurrecting this? 

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Lucene Developers
>Priority: Minor
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-325) [PATCH] new method expungeDeleted() added to IndexWriter

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558413#action_12558413
 ] 

Grant Ingersoll commented on LUCENE-325:


This seems generally useful.  I imagine, though, that the patch is way out of 
date.  I wonder if the new ability to merge some segments might have an option 
to do this kind of thing.

Any thoughts on resurrecting this? 

> [PATCH] new method expungeDeleted() added to IndexWriter
> 
>
> Key: LUCENE-325
> URL: https://issues.apache.org/jira/browse/LUCENE-325
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: CVS Nightly - Specify date in submission
> Environment: Operating System: Windows XP
> Platform: All
>Reporter: John Wang
>Assignee: Lucene Developers
>Priority: Minor
> Attachments: attachment.txt, IndexWriter.patch, IndexWriter.patch, 
> TestExpungeDeleted.java
>
>
> We make use the docIDs in lucene. I need a way to compact the docIDs in 
> segments
> to remove the "holes" created from doing deletes. The only way to do this is 
> by
> calling IndexWriter.optimize(). This is a very heavy call, for the cases where
> the index is large but with very small number of deleted docs, calling 
> optimize
> is not practical.
> I need a new method: expungeDeleted(), which finds all the segments that have
> delete documents and merge only those segments.
> I have implemented this method and have discussed with Otis about submitting a
> patch. I don't see where I can attached the patch. I will do according to the
> patch guidleine and email the lucene mailing list.
> Thanks
> -John
> I don't see a place where I can

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-532) [PATCH] Indexing on Hadoop distributed file system

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558412#action_12558412
 ] 

Grant Ingersoll commented on LUCENE-532:


Anyone have a follow up on this?  Seems like Hadoop based indexing would be a 
nice feature.  It sounds like there was a lot of support for this, but it was 
never committed.  Is this still an issue?

> [PATCH] Indexing on Hadoop distributed file system
> --
>
> Key: LUCENE-532
> URL: https://issues.apache.org/jira/browse/LUCENE-532
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9
>Reporter: Igor Bolotin
>Priority: Minor
> Attachments: cfs-patch.txt, indexOnDFS.patch, SegmentTermEnum.patch, 
> TermInfosWriter.patch
>
>
> In my current project we needed a way to create very large Lucene indexes on 
> Hadoop distributed file system. When we tried to do it directly on DFS using 
> Nutch FsDirectory class - we immediately found that indexing fails because 
> DfsIndexOutput.seek() method throws UnsupportedOperationException. The reason 
> for this behavior is clear - DFS does not support random updates and so 
> seek() method can't be supported (at least not easily).
>  
> Well, if we can't support random updates - the question is: do we really need 
> them? Search in the Lucene code revealed 2 places which call 
> IndexOutput.seek() method: one is in TermInfosWriter and another one in 
> CompoundFileWriter. As we weren't planning to use CompoundFileWriter - the 
> only place that concerned us was in TermInfosWriter.
>  
> TermInfosWriter uses IndexOutput.seek() in its close() method to write total 
> number of terms in the file back into the beginning of the file. It was very 
> simple to change file format a little bit and write number of terms into last 
> 8 bytes of the file instead of writing them into beginning of file. The only 
> other place that should be fixed in order for this to work is in 
> SegmentTermEnum constructor - to read this piece of information at position = 
> file length - 8.
>  
> With this format hack - we were able to use FsDirectory to write index 
> directly to DFS without any problems. Well - we still don't index directly to 
> DFS for performance reasons, but at least we can build small local indexes 
> and merge them into the main index on DFS without copying big main index back 
> and forth. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-639) [PATCH] Slight performance improvement for readVInt() of IndexInput

2008-01-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-639.


Resolution: Won't Fix

The testing and discussion seem inconclusive on this and it hasn't been 
followed up on in a good amount of time, so I am going to mark it as won't fix.

> [PATCH] Slight performance improvement for readVInt() of IndexInput
> ---
>
> Key: LUCENE-639
> URL: https://issues.apache.org/jira/browse/LUCENE-639
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Johan Stuyts
>Priority: Minor
> Attachments: Lucene2ReadVIntPerformance.patch, readVInt performance 
> results.pdf, ReadVIntPerformanceMain.java
>
>
> By unrolling the loop in readVInt() I was able to get a slight, about 1.8 %, 
> performance improvement for this method. The test program invoked the method 
> over 17 million times on each run.
> I ran the performance tests on:
> - Windows XP Pro SP2
> - Sun JDK 1.5.0_07
> - YourKit 5.5.4
> - Lucene trunk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1106) Clean up old JIRA issues in component "Index"

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558410#action_12558410
 ] 

Grant Ingersoll commented on LUCENE-1106:
-

I've gone through LUCENE-602 above.

I think 602 can be marked "won't fix", but will wait to hear from Chuck on it.  


> Clean up old JIRA issues in component "Index"
> -
>
> Key: LUCENE-1106
> URL: https://issues.apache.org/jira/browse/LUCENE-1106
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Index
>Reporter: Michael Busch
>Priority: Trivial
> Fix For: 2.3
>
>
> A list of all JIRA issues in component "Index" that haven't been updated in 
> 2007:
>*  LUCENE-737   Provision of encryption/decryption services API to 
> support Field.Store.Encrypted   
>*  LUCENE-705  CompoundFileWriter should pre-set its file length 
>*  LUCENE-685  Extract interface from IndexWriter 
>*  LUCENE-671  Hashtable based Document 
>*  LUCENE-652  Compressed fields should be "externalized" (from Fields 
> into Document) 
>*  LUCENE-639  [PATCH] Slight performance improvement for readVInt() 
> of IndexInput 
>*  LUCENE-606  Change behavior of ParallelReader.document(int) 
>*  LUCENE-602  [PATCH] Filtering tokens for position and term vector 
> storage 
>*  LUCENE-600  ParallelWriter companion to ParallelReader 
>*  LUCENE-570  Expose directory on IndexReader 
>*  LUCENE-552  NPE during mergeSegments 
>*  LUCENE-532  [PATCH] Indexing on Hadoop distributed file system 
>*  LUCENE-518  document field lengths count analyzer synonym overlays 
>*  LUCENE-517  norm compression breaks ranking for small fields 
>*  LUCENE-508  SegmentTermEnum.next() doesn't maintain prevBuffer at 
> end 
>*  LUCENE-506  Optimize Memory Use for Short-Lived Indexes (Do not 
> load TermInfoIndex if you know the queries ahead of time) 
>*  LUCENE-505  MultiReader.norm() takes up too much memory: norms 
> byte[] should be made into an Object 
>*  LUCENE-402  addition of a previous() method to TermEnum 
>*  LUCENE-401  [PATCH] fixes for gcj target. 
>*  LUCENE-382  [PATCH] Document update contrib (Play with term 
> postings or .. to a easy way to update) 
>*  LUCENE-362  [PATCH] Extension to binary Fields that allows fixed 
> byte buffer 
>*  LUCENE-336  [PATCH] Add ability to specify the segment name when 
> optimizing an index 
>*  LUCENE-325  [PATCH] new method expungeDeleted() added to 
> IndexWriter 
>*  LUCENE-211  [Patch] replace DocumentWriter with InvertedDocument 
> for performance 
>*  LUCENE-112  [PATCH] Add an IndexReader implementation that frees 
> resources when idle and refreshes itself when stale 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-602) [PATCH] Filtering tokens for position and term vector storage

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558409#action_12558409
 ] 

Grant Ingersoll commented on LUCENE-602:


I think, if I understand the problem correctly, that the new TeeTokenFilter and 
SinkTokenizer could also solve this problem, right Chuck?

> [PATCH] Filtering tokens for position and term vector storage
> -
>
> Key: LUCENE-602
> URL: https://issues.apache.org/jira/browse/LUCENE-602
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.1
>Reporter: Chuck Williams
>Priority: Minor
> Attachments: TokenSelectorAllWithParallelWriter.patch, 
> TokenSelectorSoloAll.patch
>
>
> This patch provides a new TokenSelector mechanism to select tokens of 
> interest and creates two new IndexWriter configuration parameters:  
> termVectorTokenSelector and positionsTokenSelector.
> termVectorTokenSelector, if non-null, selects which index tokens will be 
> stored in term vectors.  If positionsTokenSelector is non-null, then any 
> tokens it rejects will have only their first position stored in each document 
> (it is necessary to store one position to keep the doc freq properly to avoid 
> the token being garbage collected in merges).
> This mechanism provides a simple solution to the problem of minimzing index 
> size overhead cause by storing extra tokens that facilitate queries, in those 
> cases where the mere existence of the extra tokens is sufficient.  For 
> example, in my test data using reverse tokens to speed prefix wildcard 
> matching, I obtained the following index overheads:
>   1.  With no TokenSelectors:  60% larger with reverse tokens than without
>   2.  With termVectorTokenSelector rejecting reverse tokens:  36% larger
>   3.  With both positionsTokenSelector and termVectorTokenSelector rejecting 
> reverse tokens:  25% larger
> It is possible to obtain the same effect by using a separate field that has 
> one occurrence of each reverse token and no term vectors, but this can be 
> hard or impossible to do and a performance problem as it requires either 
> rereading the content or storing all the tokens for subsequent processing.
> The solution with TokenSelectors is very easy to use and fast.
> Otis, thanks for leaving a comment in QueryParser.jj with the correct 
> production to enable prefix wildcards!  With this, it is a straightforward 
> matter to override the wildcard query factory method and use reverse tokens 
> effectively.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1130) Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang

2008-01-13 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1130:
---

Attachment: LUCENE-1130.take2.patch

Attached take2 patch.

I created a few more disk full threaded stress tests, whereby multiple
threads are indexing, at some point start hitting disk full, but keep
on trying to add docs for a while after that disk full.

This uncovered a number of sneaky thread safety issues with how
DocumentsWriter was handling exceptions, aborting, etc..

I've fixed them, and all tests pass.  I'll wait another day before
committing.


> Hitting disk full during DocumentWriter.ThreadState.init(...) can cause hang
> 
>
> Key: LUCENE-1130
> URL: https://issues.apache.org/jira/browse/LUCENE-1130
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.3
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: LUCENE-1130.patch, LUCENE-1130.take2.patch
>
>
> More testing of RC2 ...
> I found one case, if you hit disk full during init() in
> DocumentsWriter.ThreadState, when we first create the term vectors &
> fields writer, such that subsequent calls to
> IndexWriter.add/updateDocument will then hang forever.
> What's happening in this case is we are incrementing nextDocID even
> though we never call finishDocument (because we "thought" init did not
> succeed).  Then, when we finish the next document, it will never
> actually write because missing finishDocument call never happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-652) Compressed fields should be "externalized" (from Fields into Document)

2008-01-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-652:
---

Fix Version/s: 2.9
  Description: 
Right now, as of 2.0 release, Lucene supports compressed stored fields.  
However, after discussion on java-dev, the suggestion arose, from Robert 
Engels, that it would be better if this logic were moved into the Document 
level.  This way the indexing level just stores opaque binary fields, and then 
Document handles compress/uncompressing as needed.

This approach would have prevented issues like LUCENE-629 because merging of 
segments would never need to decompress.

See this thread for the recent discussion:

http://www.gossamer-threads.com/lists/lucene/java-dev/38836

When we do this we should also work on related issue LUCENE-648.

  was:

Right now, as of 2.0 release, Lucene supports compressed stored fields.  
However, after discussion on java-dev, the suggestion arose, from Robert 
Engels, that it would be better if this logic were moved into the Document 
level.  This way the indexing level just stores opaque binary fields, and then 
Document handles compress/uncompressing as needed.

This approach would have prevented issues like LUCENE-629 because merging of 
segments would never need to decompress.

See this thread for the recent discussion:

http://www.gossamer-threads.com/lists/lucene/java-dev/38836

When we do this we should also work on related issue LUCENE-648.


> Compressed fields should be "externalized" (from Fields into Document)
> --
>
> Key: LUCENE-652
> URL: https://issues.apache.org/jira/browse/LUCENE-652
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> Right now, as of 2.0 release, Lucene supports compressed stored fields.  
> However, after discussion on java-dev, the suggestion arose, from Robert 
> Engels, that it would be better if this logic were moved into the Document 
> level.  This way the indexing level just stores opaque binary fields, and 
> then Document handles compress/uncompressing as needed.
> This approach would have prevented issues like LUCENE-629 because merging of 
> segments would never need to decompress.
> See this thread for the recent discussion:
> http://www.gossamer-threads.com/lists/lucene/java-dev/38836
> When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-652) Compressed fields should be "externalized" (from Fields into Document)

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558407#action_12558407
 ] 

Grant Ingersoll commented on LUCENE-652:


Implementing this would mean deprecating Field.Store.COMPRESS and the various 
other places that use/set bits marking a field as compressed.

Seems like a reasonable thing to do.  I will mark this as a 2.9 issue, so that 
we make sure we deprecate it at or before that time.

> Compressed fields should be "externalized" (from Fields into Document)
> --
>
> Key: LUCENE-652
> URL: https://issues.apache.org/jira/browse/LUCENE-652
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
>
> Right now, as of 2.0 release, Lucene supports compressed stored fields.  
> However, after discussion on java-dev, the suggestion arose, from Robert 
> Engels, that it would be better if this logic were moved into the Document 
> level.  This way the indexing level just stores opaque binary fields, and 
> then Document handles compress/uncompressing as needed.
> This approach would have prevented issues like LUCENE-629 because merging of 
> segments would never need to decompress.
> See this thread for the recent discussion:
> http://www.gossamer-threads.com/lists/lucene/java-dev/38836
> When we do this we should also work on related issue LUCENE-648.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Closed: (LUCENE-648) Allow changing of ZIP compression level for compressed fields

2008-01-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll closed LUCENE-648.
--


> Allow changing of ZIP compression level for compressed fields
> -
>
> Key: LUCENE-648
> URL: https://issues.apache.org/jira/browse/LUCENE-648
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
>Priority: Minor
>
> In response to this thread:
>   http://www.gossamer-threads.com/lists/lucene/java-user/38810
> I think we should allow changing the compression level used in the call to 
> java.util.zip.Deflator in FieldsWriter.java.  Right now it's hardwired to 
> "best":
>   compressor.setLevel(Deflater.BEST_COMPRESSION);
> Unfortunately, this can apparently cause the zip library to take a very long 
> time (10 minutes for 4.5 MB in the above thread) and so people may want to 
> change this setting.
> One approach would be to read the default from a Java system property, but, 
> it seems recently (pre 2.0 I think) there was an effort to not rely on Java 
> System properties (many were removed).
> A second approach would be to add static methods (and static class attr) to 
> globally set the compression level?
> A third method would be in document.Field class, eg a 
> setCompressLevel/getCompressLevel?  But then every time a document is created 
> with this field you'd have to call setCompressLevel since Lucene doesn't have 
> a global Field schema (like Solr).
> Any other ideas / prefererences for either of these methods?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-705) CompoundFileWriter should pre-set its file length

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558406#action_12558406
 ] 

Grant Ingersoll commented on LUCENE-705:


This seems reasonable, although I am not an expert in low-level file system 
calls like this.  I guess for me the thing would be to find out if the major 
filesystems support it (Windows, OSX, Linux) and then perhaps we can deal w/ 
others from there as they arise (i.e. those that don't support it)

> CompoundFileWriter should pre-set its file length
> -
>
> Key: LUCENE-705
> URL: https://issues.apache.org/jira/browse/LUCENE-705
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.1
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
>
> I've read that if you are writing a large file, it's best to pre-set
> the size of the file in advance before you write all of its contents.
> This in general minimizes fragmentation and improves IO performance
> against the file in the future.
> I think this makes sense (intuitively) but I haven't done any real
> performance testing to verify.
> Java has the java.io.File.setLength() method (since 1.2) for this.
> We can easily fix CompoundFileWriter to call setLength() on the file
> it's writing (and add setLength() method to IndexOutput).  The
> CompoundFileWriter knows exactly how large its file will be.
> Another good thing is: if you are going run out of disk space, then,
> the setLength call should fail up front instead of failing when the
> compound file is actually written.  This has two benefits: first, you
> find out sooner that you will run out of disk space, and, second, you
> don't fill up the disk down to 0 bytes left (always a frustrating
> experience!).  Instead you leave what space was available
> and throw an IOException.
> My one hesitation here is: what if out there there exists a filesystem
> that can't handle this call, and it throws an IOException on that
> platform?  But this is balanced against possible easy-win improvement
> in performance.
> Does anyone have any feedback / thoughts / experience relevant to
> this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-685) Extract interface from IndexWriter

2008-01-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558402#action_12558402
 ] 

Grant Ingersoll commented on LUCENE-685:


Well, they are hard to maintain when you want to go add a new method and you 
end up breaking 500 users who then flood the list b/c their implementations are 
broken because of interface changes.  So, in general, we favor abstract 
classes.  If Lucene had a different policy on back-compatibility, then this 
could change.  

At any rate, this issue is not for that discussion.  I'd be happy to visit it 
on java-dev, as I actually think it is an area Lucene could be better about.

> Extract interface from IndexWriter
> --
>
> Key: LUCENE-685
> URL: https://issues.apache.org/jira/browse/LUCENE-685
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 2.0.0
>Reporter: Kenny MacLeod
>Priority: Minor
> Attachments: InterfaceIndexWriter.java
>
>
> org.apache.lucene.index.IndexWriter should probably implement an interface to 
> allow us to more easily write unit tests that use it.  As it stands, it's a 
> complex class that's hard to stub/mock.
> For example, an interface which had methods such as addDocument(), close() 
> and optimize().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Reopened: (LUCENE-400) NGramFilter -- construct n-grams from a TokenStream

2008-01-13 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reopened LUCENE-400:



Good catch, Steve.  I will reopen, as a word based ngram filter is useful.

> NGramFilter -- construct n-grams from a TokenStream
> ---
>
> Key: LUCENE-400
> URL: https://issues.apache.org/jira/browse/LUCENE-400
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: unspecified
> Environment: Operating System: All
> Platform: All
>Reporter: Sebastian Kirsch
>Priority: Minor
> Attachments: NGramAnalyzerWrapper.java, 
> NGramAnalyzerWrapperTest.java, NGramFilter.java, NGramFilterTest.java
>
>
> This filter constructs n-grams (token combinations up to a fixed size, 
> sometimes
> called "shingles") from a token stream.
> The filter sets start offsets, end offsets and position increments, so
> highlighting and phrase queries should work.
> Position increments > 1 in the input stream are replaced by filler tokens
> (tokens with termText "_" and endOffset - startOffset = 0) in the output
> n-grams. (Position increments > 1 in the input stream are usually caused by
> removing some tokens, eg. stopwords, from a stream.)
> The filter uses CircularFifoBuffer and UnboundedFifoBuffer from Apache
> Commons-Collections.
> Filter, test case and an analyzer are attached.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-893) Increase buffer sizes used during searching

2008-01-13 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558387#action_12558387
 ] 

Paul Elschot commented on LUCENE-893:
-

The last case is also the one in which scoring docs allowed out of order using 
BooleanScorer is faster than using DisjunctionSumScorer. This option is already 
available, but it could have a bigger impact when term buffer sizes are chosen 
closer to optimal.

> Increase buffer sizes used during searching
> ---
>
> Key: LUCENE-893
> URL: https://issues.apache.org/jira/browse/LUCENE-893
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Michael McCandless
>
> Spinoff of LUCENE-888.
> In LUCENE-888 we increased buffer sizes that impact indexing and found
> substantial (10-18%) overall performance gains.
> It's very likely that we can also gain some performance for searching
> by increasing the read buffers in BufferedIndexInput used by
> searching.
> We need to test performance impact to verify and then pick a good
> overall default buffer size, also being careful not to add too much
> overall HEAP RAM usage because a potentially very large number of
> BufferedIndexInput instances are created during searching
> (# segments X # index files per segment).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-893) Increase buffer sizes used during searching

2008-01-13 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558386#action_12558386
 ] 

Paul Elschot commented on LUCENE-893:
-

I think the different results of 26 May 2007 for conjunction queries and 
disjunction queries may be caused by the use of TermScorer.skipTo() in 
conjunctions and TermScorer.next() in disjunctions.

That points to different optimal buffer sizes for conjunctions (smaller because 
of the skipping) and for disjunctions (larger because all postings are going to 
be needed).

LUCENE-430 is about reducing term buffer size for the case when the buffer is 
not going to be used completely because of the small number of documents 
containing the term.

In all, I think it makes sense to allow the  (conjunction/disjunction)Scorer to 
choose the maximum buffer size for the term, and let the term itself choose a 
lower value when it needs less than that.

Another way to promote sequential reading for disjunction queries is to process 
all their terms sequentially, i.e. one term at a time. In lucene this is 
currently done by Filters for prefix queries and ranges. Unfortunately this 
cannot be done when the combined frequency of the terms in each document is 
needed. In that case DisjunctionSumScorer could be used, with larger buffers on 
the terms that are contained in many documents.

> Increase buffer sizes used during searching
> ---
>
> Key: LUCENE-893
> URL: https://issues.apache.org/jira/browse/LUCENE-893
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Michael McCandless
>
> Spinoff of LUCENE-888.
> In LUCENE-888 we increased buffer sizes that impact indexing and found
> substantial (10-18%) overall performance gains.
> It's very likely that we can also gain some performance for searching
> by increasing the read buffers in BufferedIndexInput used by
> searching.
> We need to test performance impact to verify and then pick a good
> overall default buffer size, also being careful not to add too much
> overall HEAP RAM usage because a potentially very large number of
> BufferedIndexInput instances are created during searching
> (# segments X # index files per segment).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]