from:"Earwin Burrfoot \(JIRA\)"

[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034640#comment-13034640
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

Hmm.. Ok, it *is* still used, but that's gonna be fixed, mm?

> String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
> for index with large number of unique field names
> 
>
> Key: LUCENE-3105
> URL: https://issues.apache.org/jira/browse/LUCENE-3105
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Mark Kristensson
> Attachments: LUCENE-3105.patch
>
>
> We have one index with several hundred thousand unqiue field names (we're 
> optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
> design...) and found that opening an index writer and closing an index reader 
> results in horribly slow performance on that one index. I have isolated the 
> problem down to the calls to String.intern() that are used to allow for quick 
> string comparisons of field names throughout Lucene. These String.intern() 
> calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
> StringHelper.java has its own hashmap implementation that it uses in 
> conjunction with String.intern(). Rather than using a one-off hashmap, I've 
> elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3105) String.intern() calls slow down IndexWriter.close() and IndexReader.open() for index with large number of unique field names

2011-05-17 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034639#comment-13034639
 ] 

Earwin Burrfoot commented on LUCENE-3105:
-

StringInterner is in fact faster than CHM. And is compatible with 
String.intern(), ie - it returns the same String instances. It also won't eat 
up memory if spammed with numerous unique strings (which is a strange feature, 
but people requested that).
In Lucene 4.0 all of this is moot anyway, fields there are strongly separated 
and intern() is not used.

> String.intern() calls slow down IndexWriter.close() and IndexReader.open() 
> for index with large number of unique field names
> 
>
> Key: LUCENE-3105
> URL: https://issues.apache.org/jira/browse/LUCENE-3105
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 3.1
>Reporter: Mark Kristensson
> Attachments: LUCENE-3105.patch
>
>
> We have one index with several hundred thousand unqiue field names (we're 
> optimistic that Lucene 4.0 is flexible enough to allow us to change our index 
> design...) and found that opening an index writer and closing an index reader 
> results in horribly slow performance on that one index. I have isolated the 
> problem down to the calls to String.intern() that are used to allow for quick 
> string comparisons of field names throughout Lucene. These String.intern() 
> calls are unnecessary and can be replaced with a hashmap lookup. In fact, 
> StringHelper.java has its own hashmap implementation that it uses in 
> conjunction with String.intern(). Rather than using a one-off hashmap, I've 
> elected to use a ConcurrentHashMap in this patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-05-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033279#comment-13033279
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

As mentioned @LUCENE-3092, it would be nice not to include the OneMerge, but 
some meaningful value like 'expectedSize', 'expectedSegmentSize' or whatnot, 
that would work both for merges *and* flushes, and also won't introduce 
needless dependency on MergePolicy.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Assignee: Simon Willnauer
>  Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032997#comment-13032997
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

bq. The IOCtx should reference the OneMerge (if in fact this file is being 
opened because of a merge)?
IOCtx should have a value 'expectedSize', or 'priority', or something similar.
This does not introduce a transitive dependency of Directory from MergePolicy 
(to please you once more - a true WTF), and this allows to apply the same logic 
to flushes. Eg - all small flushes/merges go to cache, all big flushes/merges 
go straight to disk.

> NRTCachingDirectory, to buffer small segments in a RAMDir
> -
>
> Key: LUCENE-3092
> URL: https://issues.apache.org/jira/browse/LUCENE-3092
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch
>
>
> I created this simply Directory impl, whose goal is reduce IO
> contention in a frequent reopen NRT use case.
> The idea is, when reopening quickly, but not indexing that much
> content, you wind up with many small files created with time, that can
> possibly stress the IO system eg if merges, searching are also
> fighting for IO.
> So, NRTCachingDirectory puts these newly created files into a RAMDir,
> and only when they are merged into a too-large segment, does it then
> write-through to the real (delegate) directory.
> This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032989#comment-13032989
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

bq. but I couldn't disagree more that this is an issue with an Event model
There are no issues with event model itself. It's just that this model is badly 
suitable for this issue's usecase.
Event listeners are good. Using them to emulate what is essentially a mutex - 
is ugly and fragile as hell.

bq. We have a series of components in Lucene; Directories, IndexWriter, 
MergeScheduler etc, and we have some crosscutting concerns such as merges 
themselves.
My point is that for many concerns they shouldn't necessarily be crosscutting.
Eg - Directory can support IO priorities/throttling, so it doesn't have to know 
about merges or flushes.
Many OSes have have special APIs that allow IO prioritization, do they know 
about merges, or Lucene at all? No.

> NRTCachingDirectory, to buffer small segments in a RAMDir
> -
>
> Key: LUCENE-3092
> URL: https://issues.apache.org/jira/browse/LUCENE-3092
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch
>
>
> I created this simply Directory impl, whose goal is reduce IO
> contention in a frequent reopen NRT use case.
> The idea is, when reopening quickly, but not indexing that much
> content, you wind up with many small files created with time, that can
> possibly stress the IO system eg if merges, searching are also
> fighting for IO.
> So, NRTCachingDirectory puts these newly created files into a RAMDir,
> and only when they are merged into a too-large segment, does it then
> write-through to the real (delegate) directory.
> This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032936#comment-13032936
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

Chris, I don't like the idea of expanding IOContext again and again, but this 
case seems in line with intended purporse - give Directory implementation hints 
as to what we're going to do with it.

I don't like events either. They look fragile and binding them to threads is a 
WTF. With all our pausing/unpausing magic there's no guarantee merge will end 
on the same thread it started on.

bq. Stuff like FlushPolicy could take information about concurrent merges and 
hold of flushes for a little while if memory allows it etc.
Coordinating access to shared resource (IO subsystem) with events is very 
awkward. Ok, your FlushPolicy receives events from MergePolicy and holds 
flushes during merge. _Now, when a flush is in progress, should FlushPolicy 
notify MergePolicy so it can hold its merges?_
It goes downhill from there. What if FP and MP fire events simultaneously? :) 
What should other listeners do?

Try looking at a bigger picture. Merges are not your problem. Neither are 
flushes. Your problem is that several threads try to take their dump on disk 
simultaneously (for whatever reason, you don't really care). So what we need is 
an arbitration mechanism for Directory writes. A mechanism located presumably @ 
Directory level (eg, we don't need to throttle anything when writing to RAMDir).

One possible implementation is that we add a constructor parameter to 
FSDirectory specifying desired level of IO parallelism, and then it keeps track 
of its IndexOutputs and stalls writes selectively. We can also add 
'expectedWriteSize' to IOContext, so the Directory may favor shorter writes 
over bigger ones. Instead of 'expectedWriteSize' we can use 'priority'.

> NRTCachingDirectory, to buffer small segments in a RAMDir
> -
>
> Key: LUCENE-3092
> URL: https://issues.apache.org/jira/browse/LUCENE-3092
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch
>
>
> I created this simply Directory impl, whose goal is reduce IO
> contention in a frequent reopen NRT use case.
> The idea is, when reopening quickly, but not indexing that much
> content, you wind up with many small files created with time, that can
> possibly stress the IO system eg if merges, searching are also
> fighting for IO.
> So, NRTCachingDirectory puts these newly created files into a RAMDir,
> and only when they are merged into a too-large segment, does it then
> write-through to the real (delegate) directory.
> This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3092) NRTCachingDirectory, to buffer small segments in a RAMDir

2011-05-12 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032841#comment-13032841
 ] 

Earwin Burrfoot commented on LUCENE-3092:
-

*highfive Uwe* was going to suggest the very same thing.
IOContext can include expected size. NRTCD will do its magic, other dirs may 
prealloc.

> NRTCachingDirectory, to buffer small segments in a RAMDir
> -
>
> Key: LUCENE-3092
> URL: https://issues.apache.org/jira/browse/LUCENE-3092
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3092-listener.patch, LUCENE-3092.patch
>
>
> I created this simply Directory impl, whose goal is reduce IO
> contention in a frequent reopen NRT use case.
> The idea is, when reopening quickly, but not indexing that much
> content, you wind up with many small files created with time, that can
> possibly stress the IO system eg if merges, searching are also
> fighting for IO.
> So, NRTCachingDirectory puts these newly created files into a RAMDir,
> and only when they are merged into a too-large segment, does it then
> write-through to the real (delegate) directory.
> This lets you spend some RAM to reduce I0.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos

2011-05-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032099#comment-13032099
 ] 

Earwin Burrfoot commented on LUCENE-3084:
-

bq. Merges are ordered
Hmm.. Why should they be?

bq. SegmentInfos itself must be list
It may contain list as a field instead. And have a much cleaner API as a 
consequence.

On another note, I wonder, is the fact that Vector is internally synchronized 
used somewhere within SegmentInfos client code?

> MergePolicy.OneMerge.segments should be List not SegmentInfos
> --
>
> Key: LUCENE-3084
> URL: https://issues.apache.org/jira/browse/LUCENE-3084
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3084-trunk-only.patch, 
> LUCENE-3084-trunk-only.patch, LUCENE-3084.patch
>
>
> SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
> purposes these fields are unused.
> We should cutover to List instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3084) MergePolicy.OneMerge.segments should be List not SegmentInfos

2011-05-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032046#comment-13032046
 ] 

Earwin Burrfoot commented on LUCENE-3084:
-

* Speaking logically, merges operate on Sets of SIs, not List?
* Let's stop subclassing random things? : ) SIS can contain a List of SIs (and 
maybe a Set, or whatever we need in the future), and only expose operations its 
clients really need.

> MergePolicy.OneMerge.segments should be List not SegmentInfos
> --
>
> Key: LUCENE-3084
> URL: https://issues.apache.org/jira/browse/LUCENE-3084
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3084-trunk-only.patch, LUCENE-3084.patch
>
>
> SegmentInfos carries a bunch of fields beyond the list of SI, but for merging 
> purposes these fields are unused.
> We should cutover to List instead.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3077) DWPT doesn't see changes to DW#infoStream

2011-05-06 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029881#comment-13029881
 ] 

Earwin Burrfoot commented on LUCENE-3077:
-

We should just make it final everywhere ...

> DWPT doesn't see changes to DW#infoStream
> -
>
> Key: LUCENE-3077
> URL: https://issues.apache.org/jira/browse/LUCENE-3077
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 4.0
>Reporter: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
>
> DW does not push infostream changes to DWPT since DWPT#infoStream is final 
> and initialized on DWPTPool initialization (at least for initial DWPT) we 
> should push changes to infostream to DWPT too

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3065) NumericField should be stored in binary format in index (matching Solr's format)

2011-05-05 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029421#comment-13029421
 ] 

Earwin Burrfoot commented on LUCENE-3065:
-

It's sad NumericFields are hardbaked into index format.

Eg - I have some fields that are similar to Numeric in that they are 
'stringified' binary structures, and they can't become first-class in the same 
manner as Numeric.

> NumericField should be stored in binary format in index (matching Solr's 
> format)
> 
>
> Key: LUCENE-3065
> URL: https://issues.apache.org/jira/browse/LUCENE-3065
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, 
> LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch, LUCENE-3065.patch
>
>
> (Spinoff of LUCENE-3001)
> Today when writing stored fields we don't record that the field was a 
> NumericField, and so at IndexReader time you get back an "ordinary" Field and 
> your number has turned into a string.  See 
> https://issues.apache.org/jira/browse/LUCENE-1701?focusedCommentId=12721972&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12721972
> We have spare bits already in stored fields, so, we should use one to record 
> that the field is numeric, and then encode the numeric field in Solr's 
> more-compact binary format.
> A nice side-effect is we fix the long standing issue that you don't get a 
> NumericField back when loading your document.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running

2011-05-05 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029408#comment-13029408
 ] 

Earwin Burrfoot commented on LUCENE-2904:
-

Ok, I'm wrong. We need both a list of all SIs and eligible SIs for 
calculations. But that should be handled through API change, not a new public 
method on IW.

> non-contiguous LogMergePolicy should be careful to not select merges already 
> running
> 
>
> Key: LUCENE-2904
> URL: https://issues.apache.org/jira/browse/LUCENE-2904
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-2904.patch
>
>
> Now that LogMP can do non-contiguous merges, the fact that it disregards 
> which segments are already being merged is more problematic since it could 
> result in it returning conflicting merges and thus failing to run multiple 
> merges concurrently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2904) non-contiguous LogMergePolicy should be careful to not select merges already running

2011-05-05 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029403#comment-13029403
 ] 

Earwin Burrfoot commented on LUCENE-2904:
-

I think we should simply change the API for MergePolicy.
Instead of SegmentInfos it should accept a Set with SIs eligible 
for merging (eg, completely written & not elected for another merge).
IW.getMergingSegments() is a damn cheat, and "Expert" notice is not an excuse! 
:)
Why should each and every MP do the set substraction when IW can do it for them 
once and for all? 

> non-contiguous LogMergePolicy should be careful to not select merges already 
> running
> 
>
> Key: LUCENE-2904
> URL: https://issues.apache.org/jira/browse/LUCENE-2904
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-2904.patch
>
>
> Now that LogMP can do non-contiguous merges, the fact that it disregards 
> which segments are already being merged is more problematic since it could 
> result in it returning conflicting merges and thus failing to run multiple 
> merges concurrently.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3061) Open IndexWriter API to allow custom MergeScheduler implementation

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027626#comment-13027626
 ] 

Earwin Burrfoot commented on LUCENE-3061:
-

Mark these as @experimental?

> Open IndexWriter API to allow custom MergeScheduler implementation
> --
>
> Key: LUCENE-3061
> URL: https://issues.apache.org/jira/browse/LUCENE-3061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-3061.patch, LUCENE-3061.patch
>
>
> IndexWriter's getNextMerge() and merge(OneMerge) are package-private, which 
> makes it impossible for someone to implement his own MergeScheduler. We 
> should open up these API, as well as any other that can be useful for custom 
> MS implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot edited comment on LUCENE-3041 at 5/2/11 10:30 AM:
--

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
Same can be said for tests.

What about throwing original invocation exception instead of the wrapper? Since 
we're emulating a language feature, a simple method call, it's logical to only 
throw custom exceptions in .. well .. exceptional cases, like ambiguity/no 
matching method. If client code throws Errors/RuntimeExceptions, they should be 
transparently rethrown.

  was (Author: earwin):
The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?
  
> Support Query Visting / Walking
> ---
>
> Key: LUCENE-3041
> URL: https://issues.apache.org/jira/browse/LUCENE-3041
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Chris Male
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
> LUCENE-3041.patch, LUCENE-3041.patch
>
>
> Out of the discussion in LUCENE-2868, it could be useful to add a generic 
> Query Visitor / Walker that could be used for more advanced rewriting, 
> optimizations or anything that requires state to be stored as each Query is 
> visited.
> We could keep the interface very simple:
> {code}
> public interface QueryVisitor {
>   Query visit(Query query);
> }
> {code}
> and then use a reflection based visitor like Earwin suggested, which would 
> allow implementators to provide visit methods for just Querys that they are 
> interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking

2011-05-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027612#comment-13027612
 ] 

Earwin Burrfoot commented on LUCENE-3041:
-

The static cache is now not threadsafe.
And original had nice diagnostics for ambigous dispatches. Why not just take it 
and cut over to JDK reflection and CHM?

> Support Query Visting / Walking
> ---
>
> Key: LUCENE-3041
> URL: https://issues.apache.org/jira/browse/LUCENE-3041
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 4.0
>Reporter: Chris Male
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
> LUCENE-3041.patch, LUCENE-3041.patch
>
>
> Out of the discussion in LUCENE-2868, it could be useful to add a generic 
> Query Visitor / Walker that could be used for more advanced rewriting, 
> optimizations or anything that requires state to be stored as each Query is 
> visited.
> We could keep the interface very simple:
> {code}
> public interface QueryVisitor {
>   Query visit(Query query);
> }
> {code}
> and then use a reflection based visitor like Earwin suggested, which would 
> allow implementators to provide visit methods for just Querys that they are 
> interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3055) LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers

2011-04-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027361#comment-13027361
 ] 

Earwin Burrfoot commented on LUCENE-3055:
-

Could anyone remind me, why the hell do we still have Analyzer.tokenStream AND 
reusableTokenStream rampaging around and confusing minds? We always recommend 
to use the latter, Robert just fixed some of the core classes to use the latter.

Also, if reusableTokenStream is the only method left standing, isn't it wise to 
hide actual reuse somewhere in Lucene internals and turn Analyzer into plain 
and dumb factory interface?

> LUCENE-2372, LUCENE-2389 made it impossible to subclass core analyzers
> --
>
> Key: LUCENE-3055
> URL: https://issues.apache.org/jira/browse/LUCENE-3055
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.1
>Reporter: Ian Soboroff
>
> LUCENE-2372 and LUCENE-2389 marked all analyzers as final.  This makes 
> ReusableAnalyzerBase useless, and makes it impossible to subclass e.g. 
> StandardAnalyzer to make a small modification e.g. to tokenStream().  These 
> issues don't indicate a new method of doing this.  The issues don't give a 
> reason except for design considerations, which seems a poor reason to make a 
> backward-incompatible change

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3041) Support Query Visting / Walking

2011-04-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027080#comment-13027080
 ] 

Earwin Burrfoot commented on LUCENE-3041:
-

I vehemently oppose introducing the "visitor design pattern" (classic 
double-dispatch version) into the Query API. It is a badly broken replacement 
(ie, cannot be easily extended) for multiple dispatch.

Also, from the looks of it (short IRC discussion), user-written visitors and 
rewrite() API have totally different aims.
- rewrite() is very specific (it is a pre-search preparation that produces 
runnable query, eg expands multi-term queries into OR sequences or wrapped 
filters), but should work over any kinds of user-written Queries with possibly 
exotic behaviours (eg, take rewrite from the cache). Consequently, the logic is 
tightly coupled to each Query-impl innards.
- user-written visitors on the other hand, may have a multitude of purporses 
(wildly varying logic for node handling + navigation - eg, some may want to see 
MTQs expanded, and some may not) over relatively fixed number of possible node 
types.

So the best possible solution so far is to keep rewrite() asis - it serves its 
purporse quite well.
And introduce generic reflection-based multiple-dispatch visitor that can walk 
any kind of hierarchies (eg, in my project I rewrite ASTs to ASTs, ASTs to 
Queries, and Queries to bags of Terms) so people can transform their query 
trees.
The current patch contains a derivative of [my original 
version|https://gist.github.com/dfebaf79f5524e6ea8b4]. And here's a 
[test/example|https://gist.github.com/e5eb67d762be0bce8d28]
This visitor keeps all logic on itself and thus cannot replace rewrite().

> Support Query Visting / Walking
> ---
>
> Key: LUCENE-3041
> URL: https://issues.apache.org/jira/browse/LUCENE-3041
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Chris Male
>Priority: Minor
> Attachments: LUCENE-3041.patch, LUCENE-3041.patch, LUCENE-3041.patch, 
> LUCENE-3041.patch
>
>
> Out of the discussion in LUCENE-2868, it could be useful to add a generic 
> Query Visitor / Walker that could be used for more advanced rewriting, 
> optimizations or anything that requires state to be stored as each Query is 
> visited.
> We could keep the interface very simple:
> {code}
> public interface QueryVisitor {
>   Query visit(Query query);
> }
> {code}
> and then use a reflection based visitor like Earwin suggested, which would 
> allow implementators to provide visit methods for just Querys that they are 
> interested in.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2571) Indexing performance tests with realtime branch

2011-04-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020217#comment-13020217
 ] 

Earwin Burrfoot commented on LUCENE-2571:
-

bq. Merges are NOT blocking indexing on trunk no matter which MP you use.
Well.. merges tie up IO (especially if not on fancy SSDs/RAIDs), which in turn 
lags flushes -> bigger delays for stop the world flushes / lower bandwith cap 
(after which they are forced to stop the world) for parallel flushes.

So Lance's point is partially valid.

> Indexing performance tests with realtime branch
> ---
>
> Key: LUCENE-2571
> URL: https://issues.apache.org/jira/browse/LUCENE-2571
> Project: Lucene - Java
>  Issue Type: Task
>  Components: Index
>Reporter: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: wikimedium.realtime.Standard.nd10M_dps.png, 
> wikimedium.realtime.Standard.nd10M_dps_addDocuments.png, 
> wikimedium.realtime.Standard.nd10M_dps_addDocuments_flush.png, 
> wikimedium.trunk.Standard.nd10M_dps.png, 
> wikimedium.trunk.Standard.nd10M_dps_addDocuments.png
>
>
> We should run indexing performance tests with the DWPT changes and compare to 
> trunk.
> We need to test both single-threaded and multi-threaded performance.
> NOTE:  flush by RAM isn't implemented just yet, so either we wait with the 
> tests or flush by doc count.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2981) Review and potentially remove unused/unsupported Contribs

2011-03-31 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014108#comment-13014108
 ] 

Earwin Burrfoot commented on LUCENE-2981:
-

Bye-bye, DB. Few things can compete with it in pointlessness.

> Review and potentially remove unused/unsupported Contribs
> -
>
> Key: LUCENE-2981
> URL: https://issues.apache.org/jira/browse/LUCENE-2981
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Grant Ingersoll
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-2981.patch
>
>
> Some of our contribs appear to be lacking for development/support or are 
> missing tests.  We should review whether they are even pertinent these days 
> and potentially deprecate and remove them.
> One of the things we did in Mahout when bringing in Colt code was to mark all 
> code that didn't have tests as @deprecated and then we removed the 
> deprecation once tests were added.  Those that didn't get tests added over 
> about a 6 mos. period of time were removed.
> I would suggest taking a hard look at:
> ant
> db
> lucli
> swing
> (spatial should be gutted to some extent and moved to modules)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007136#comment-13007136
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

You avoid deprecation/undeprecation and binary incompatibility, while 
incompatibly changing semantics. What do you win?

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2960.patch
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007048#comment-13007048
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. Oh yeah. But then we'd clone the full IWC on every set... this seems like 
overkill in the name of "purity".
So what? What exactly is overkill? Few wasted bytes and CPU ns for an object 
that's created a couple of times during application lifetime?
There are also builders, which are very similar to what Steven is proposing.

bq. Another thought is to offer all settings on the IWC for init convenience 
and exposure and then add javadoc about updaters on IW for those settings that 
can be changed on the fly
That's exactly how I'd like to see it.

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2960.patch
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006759#comment-13006759
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. infoStream is a PrintStream, which synchronizes anyway, so it should be 
safe to omit the volatile
You're absolutely right here.

bq. Yet, no "real" Java impl out there will ever do this since doing so will 
simply make that Java impl appear buggy.
Sorry, but "real" Java impls do this. The case with endless get() happened on a 
map that was never modified after being created and set. Just one of the many 
JVM instances on many machines got unlucky after restart.

bq. Well, and, it'd be bad for perf. – obviously the Java impl, CPU cache 
levels, should cache only frequently used things
Java impls don't cache things. They do reorderings, they also keep final fields 
on registers, omitting reloads that happen for non-final ones, but no caching 
in JMM-related cases. Caching here is done by CPU, and it caches all data read 
from memory.

bq. IWC cannot be made immutable – you build it up incrementally (new 
IWC(...).setThis(...).setThat(...)). Its fields cannot be final.
Setters can return modified immutable copy of 'this'. So you get both 
incremental building and immutability.

bq. How about this as a compromise: IW continues cloning the incoming IWC on 
init, as it does today. This means any changes to the IWC instance you passed 
to IW will have no effect on IW.
What about earlier compromise mentioned by Shay, Mark, me? Keep setters for 
'live' properties on IW.
This clearly draws the line, and you don't have to consult Javadocs for each 
and every setting to know if you can change it live or not.

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2960.patch
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006227#comment-13006227
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

{quote}
Why such purity? What do we gain?

I'm all for purity, but only if it doesn't interfere w/ functionality.
Here, it's taking away freedom...
{quote}
We gain consistency and predictability. And there are a lot of freedoms 
dangerous for developers.

{quote}
In fact it should be fine to share an IWC across multiple writers; you
can change the RAM buffer for all of them at once.
{quote}

You've brought up a purrfect example of how NOT to do things.
This is called 'action at a distance' and is a damn bug. Very annoying one.
I've thoroughly experienced it with previous major version of Apache HTTPClient 
- they had an API that suggested you can set per-request timeouts, while these 
were actually global for a single Client instance.
I fried my brain trying to understand why the hell random user requests timeout 
at hundred times their intended duration.
Oh! It was an occasional admin request changing the global.

 You know, you can actually instantiate some DateRangeFilter with a 
couple of Dates, and then change these Dates (they are writeable) before each 
request. Isn't it an exciting kind of programming freedom?
Or, back to our current discussion - we can pass RAMBufferSizeMB as an 
AtomicDouble, instead of current double, then we can use .set() on an instance 
we passed, and have our live reconfigurability. What's more - AtomicDouble 
protects us from word tearing! 

bq. I doubt there's any JVM out there where our lack-of-volatile infoStream 
causes any problems.
Er.. While I have never personally witnessed unsynchronized long/double tearing,
I've seen the consequence of unsafely publishing a HashMap - an endless loop on 
get().
It happened on your run off the mill Sun 1.6 JVM.
So the bug is there, lying in wait. Maybe nobody ever actually used the freedom 
to change infoStream in-flight, or the guy was lucky, or in his particular 
situation the field was guarded by some unrelated sync.




While I see banishing live reconfiguration from IW as a lost cause, I ask to 
make IWC immutable at the very least. As Shay said - this will provide a clear 
barrier between mutable and immutable properties.

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005891#comment-13005891
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

bq. Furthermore, closing the IW also forces you to commit, and I don't like 
tying changing of configuration to forcing a commit.
Like I said, one isn't going to change his configuration five times a second. 
It's ok to commit from time to time?

bq. So why should we force it to be unchangeable? That can only remove freedom, 
freedom that is perhaps valuable to an app somewhere.
Each and every live reconfigurable setting adds to complexity.
At the very least it requires proper synchronization. Take your SegmentWarmer 
example - you should make the field volatile.
While it's possible to chicken out on primitive fields ([except 
long/double|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7]),
 as Yonik mentioned earlier, making nonvolatile mutable references introduces 
you to a world of hard-to-catch unsafe publication bugs (yes, infoStream is 
currently broken!).
For more complex cases, certain on-change logic is required. And then you have 
to support this logic across all possible code rewrites and refactorings.

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2960) Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter

2011-03-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005617#comment-13005617
 ] 

Earwin Burrfoot commented on LUCENE-2960:
-

As I said on the list - if one needs to change IW config, he can always 
recreate IW with new settings.
Such changes cannot happen often enough for recreation to affect indexing 
performance.

The fact that you can change IW's behaviour post-construction by modifying 
unrelated IWC instance is frightening. IW should either make a private copy of 
IWC when constructing, or IWC should be made immutable.

> Allow (or bring back) the ability to setRAMBufferSizeMB on an open IndexWriter
> --
>
> Key: LUCENE-2960
> URL: https://issues.apache.org/jira/browse/LUCENE-2960
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shay Banon
>Priority: Blocker
> Fix For: 3.1, 4.0
>
>
> In 3.1 the ability to setRAMBufferSizeMB is deprecated, and removed in trunk. 
> It would be great to be able to control that on a live IndexWriter. Other 
> possible two methods that would be great to bring back are 
> setTermIndexInterval and setReaderTermsIndexDivisor. Most of the other 
> setters can actually be set on the MergePolicy itself, so no need for setters 
> for those (I think).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2908) clean up serialization in the codebase

2011-02-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994769#comment-12994769
 ] 

Earwin Burrfoot commented on LUCENE-2908:
-

Oh, damn :)
On my project, we specifically use java-serialization to pass configured 
Queries/Filters between cluster nodes, as it saves us HEAPS of 
wrapping/unwrapping them into some parallel serializable classes.

> clean up serialization in the codebase
> --
>
> Key: LUCENE-2908
> URL: https://issues.apache.org/jira/browse/LUCENE-2908
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2908.patch
>
>
> We removed contrib/remote, but forgot to cleanup serialization hell 
> everywhere.
> this is no longer needed, never really worked (e.g. across versions), and 
> slows 
> development (e.g. i wasted a long time debugging stupid serialization of 
> Similarity.idfExplain when trying to make a patch for the scoring system).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2871) Use FileChannel in FSDirectory

2011-01-20 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984222#action_12984222
 ] 

Earwin Burrfoot commented on LUCENE-2871:
-

Before arguing where to put this new IndexOutput, I think it's wise to have a 
benchmark proving we need it at all.
I have serious doubts FileChannel's going to outperform RAF.write(). Why should 
it?
And for the purporses of benchmark it can be anywhere.

> Use FileChannel in FSDirectory
> --
>
> Key: LUCENE-2871
> URL: https://issues.apache.org/jira/browse/LUCENE-2871
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Shay Banon
> Attachments: LUCENE-2871.patch, LUCENE-2871.patch
>
>
> Explore using FileChannel in FSDirectory to see if it improves write 
> operations performance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983162#action_12983162
 ] 

Earwin Burrfoot commented on LUCENE-2657:
-

Thanks, but I'm not the one confused here. : )

> Replace Maven POM templates with full POMs, and change documentation 
> accordingly
> 
>
> Key: LUCENE-2657
> URL: https://issues.apache.org/jira/browse/LUCENE-2657
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Assignee: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch
>
>
> The current Maven POM templates only contain dependency information, the bare 
> bones necessary for uploading artifacts to the Maven repository.
> The full Maven POMs in the attached patch include the information necessary 
> to run a multi-module Maven build, in addition to serving the same purpose as 
> the current POM templates.
> Several dependencies are not available through public maven repositories.  A 
> profile in the top-level POM can be activated to install these dependencies 
> from the various {{lib/}} directories into your local repository.  From the 
> top-level directory:
> {code}
> mvn -N -Pbootstrap install
> {code}
> Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
> tests via Maven's surefire plugin, and populate your local repository with 
> all artifacts, from the top level directory, run:
> {code}
> mvn install
> {code}
> When one Lucene/Solr module depends on another, the dependency is declared on 
> the *artifact(s)* produced by the other module and deposited in your local 
> repository, rather than on the other module's un-jarred compiler output in 
> the {{build/}} directory, so you must run {{mvn install}} on the other module 
> before its changes are visible to the module that depends on it.
> To create all the artifacts without running tests:
> {code}
> mvn -DskipTests install
> {code}
> I almost always include the {{clean}} phase when I do a build, e.g.:
> {code}
> mvn -DskipTests clean install
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983160#action_12983160
 ] 

Earwin Burrfoot commented on LUCENE-2657:
-

bq. we need to be very clear  and it has no effect on artifacts
I feel something was missed in the heat of debate. Eg:
bq. The latest patch on this release uses the Ant artifacts directly.
bq. This patch uses the Ant-produced artifacts to prepare for Maven artifact 
publishing. 
bq. Maven itself is not invoked in the process. An Ant plugin handles the 
artifact deployment.
I will now try to decipher these quotes.
It seems the patch takes the artifacts produced by Ant, as a part of our usual 
(and only) build process, and shoves it down Maven repository's throat along 
with a bunch of pom-descriptors.
Nothing else is happening.

Also, after everything that has been said, I think nobody in his right mind 
will *force* anyone to actually use the Ant target in question as a part of 
release. But it's nice to have it around, in case some user-friendly commiter 
would like to push (I'd like to reiterate - ant generated) artifacts into Maven.

> Replace Maven POM templates with full POMs, and change documentation 
> accordingly
> 
>
> Key: LUCENE-2657
> URL: https://issues.apache.org/jira/browse/LUCENE-2657
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Assignee: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch
>
>
> The current Maven POM templates only contain dependency information, the bare 
> bones necessary for uploading artifacts to the Maven repository.
> The full Maven POMs in the attached patch include the information necessary 
> to run a multi-module Maven build, in addition to serving the same purpose as 
> the current POM templates.
> Several dependencies are not available through public maven repositories.  A 
> profile in the top-level POM can be activated to install these dependencies 
> from the various {{lib/}} directories into your local repository.  From the 
> top-level directory:
> {code}
> mvn -N -Pbootstrap install
> {code}
> Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
> tests via Maven's surefire plugin, and populate your local repository with 
> all artifacts, from the top level directory, run:
> {code}
> mvn install
> {code}
> When one Lucene/Solr module depends on another, the dependency is declared on 
> the *artifact(s)* produced by the other module and deposited in your local 
> repository, rather than on the other module's un-jarred compiler output in 
> the {{build/}} directory, so you must run {{mvn install}} on the other module 
> before its changes are visible to the module that depends on it.
> To create all the artifacts without running tests:
> {code}
> mvn -DskipTests install
> {code}
> I almost always include the {{clean}} phase when I do a build, e.g.:
> {code}
> mvn -DskipTests clean install
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2657) Replace Maven POM templates with full POMs, and change documentation accordingly

2011-01-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983152#action_12983152
 ] 

Earwin Burrfoot commented on LUCENE-2657:
-


I am *amazed* at how idea/eclipse ant tasks managed to get into trunk.
No, really, emacs is just enough for Lucene development, as proven by one of 
our most seasoned developers.
So why the hell are those extras polluting main tree?


> Replace Maven POM templates with full POMs, and change documentation 
> accordingly
> 
>
> Key: LUCENE-2657
> URL: https://issues.apache.org/jira/browse/LUCENE-2657
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Assignee: Steven Rowe
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch, 
> LUCENE-2657.patch, LUCENE-2657.patch, LUCENE-2657.patch
>
>
> The current Maven POM templates only contain dependency information, the bare 
> bones necessary for uploading artifacts to the Maven repository.
> The full Maven POMs in the attached patch include the information necessary 
> to run a multi-module Maven build, in addition to serving the same purpose as 
> the current POM templates.
> Several dependencies are not available through public maven repositories.  A 
> profile in the top-level POM can be activated to install these dependencies 
> from the various {{lib/}} directories into your local repository.  From the 
> top-level directory:
> {code}
> mvn -N -Pbootstrap install
> {code}
> Once these non-Maven dependencies have been installed, to run all Lucene/Solr 
> tests via Maven's surefire plugin, and populate your local repository with 
> all artifacts, from the top level directory, run:
> {code}
> mvn install
> {code}
> When one Lucene/Solr module depends on another, the dependency is declared on 
> the *artifact(s)* produced by the other module and deposited in your local 
> repository, rather than on the other module's un-jarred compiler output in 
> the {{build/}} directory, so you must run {{mvn install}} on the other module 
> before its changes are visible to the module that depends on it.
> To create all the artifacts without running tests:
> {code}
> mvn -DskipTests install
> {code}
> I almost always include the {{clean}} phase when I do a build, e.g.:
> {code}
> mvn -DskipTests clean install
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2755) Some improvements to CMS

2011-01-17 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982564#action_12982564
 ] 

Earwin Burrfoot commented on LUCENE-2755:
-

bq. if you still want to work on it, the I can keep the issue open and mark it 
3.2 (unless you want to give it a try in 3.1). 
I'll start another later, so please, go on.

> Some improvements to CMS
> 
>
> Key: LUCENE-2755
> URL: https://issues.apache.org/jira/browse/LUCENE-2755
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2755.patch
>
>
> While running optimize on a large index, I've noticed several things that got 
> me to read CMS code more carefully, and find these issues:
> * CMS may hold onto a merge if maxMergeCount is hit. That results in the 
> MergeThreads taking merges from the IndexWriter until they are exhausted, and 
> only then that blocked merge will run. I think it's unnecessary that that 
> merge will be blocked.
> * CMS sorts merges by segments size, doc-based and not bytes-based. Since the 
> default MP is LogByteSizeMP, and I hardly believe people care about doc-based 
> size segments anymore, I think we should switch the default impl. There are 
> two ways to make it extensible, if we want:
> ** Have an overridable member/method in CMS that you can extend and override 
> - easy.
> ** Have OneMerge be comparable and let the MP determine the order (e.g. by 
> bytes, docs, calibrate deletes etc.). Better, but will need to tap into 
> several places in the code, so more risky and complicated.
> On the go, I'd like to add some documentation to CMS - it's not very easy to 
> read and follow.
> I'll work on a patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl

2011-01-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982440#action_12982440
 ] 

Earwin Burrfoot commented on LUCENE-2374:
-

Another step in the same direction then. Instead of
bq. The only problem with this aproach would be that the attribute keys must be 
unique - an idea would be to prefix them with the attribute name.
Let us define interface as - void reflect(Class attributeClass, String key, 
Object value) ?
If the client code then wants to call toString() on attributeClass and concat 
with key - it's free to do so. If it wants to be more creative - it can.

> Add introspection API to AttributeSource/AttributeImpl
> --
>
> Key: LUCENE-2374
> URL: https://issues.apache.org/jira/browse/LUCENE-2374
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1, 4.0
>
>
> AttributeSource/TokenStream inspection in Solr needs to have some insight 
> into the contents of AttributeImpls. As LUCENE-2302 has some problems with 
> toString() [which is not structured and conflicts with CharSequence's 
> definition for CharTermAttribute], I propose an simple API that get a default 
> implementation in AttributeImpl (just like toString() current):
> - Iterator> AttributeImpl.contentsIterator() returns an 
> iterator (for most attributes its a singleton) of a key-value pair, e.g. 
> "term"->"foobar","startOffset"->Integer.valueOf(0),...
> - AttributeSource gets the same method, it just concat the iterators of each 
> getAttributeImplsIterator() AttributeImpl
> No backwards problems occur, as the default toString() method will work like 
> before (it just gets iterator and lists), but we simply remove the 
> documentation for the format. (Char)TermAttribute gets a special impl fo 
> toString() according to CharSequence and a corresponding iterator.
> I also want to remove the abstract hashCode() and equals() methods from 
> AttributeImpl, as they are not needed and just create work for the 
> implementor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2374) Add introspection API to AttributeSource/AttributeImpl

2011-01-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982437#action_12982437
 ] 

Earwin Burrfoot commented on LUCENE-2374:
-

Nice. Except maybe introduce a simple interface instead of the Map ?

{code}
interface AttributeReflector { // Name is crap, should be changed
  void reflect(String key, Object value);
}

void reflectWith(AttributeReflector reflector);
{code}

You have no need for fake maps then, both in toString(), and in user code.


> Add introspection API to AttributeSource/AttributeImpl
> --
>
> Key: LUCENE-2374
> URL: https://issues.apache.org/jira/browse/LUCENE-2374
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/analyzers
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1, 4.0
>
>
> AttributeSource/TokenStream inspection in Solr needs to have some insight 
> into the contents of AttributeImpls. As LUCENE-2302 has some problems with 
> toString() [which is not structured and conflicts with CharSequence's 
> definition for CharTermAttribute], I propose an simple API that get a default 
> implementation in AttributeImpl (just like toString() current):
> - Iterator> AttributeImpl.contentsIterator() returns an 
> iterator (for most attributes its a singleton) of a key-value pair, e.g. 
> "term"->"foobar","startOffset"->Integer.valueOf(0),...
> - AttributeSource gets the same method, it just concat the iterators of each 
> getAttributeImplsIterator() AttributeImpl
> No backwards problems occur, as the default toString() method will work like 
> before (it just gets iterator and lists), but we simply remove the 
> documentation for the format. (Char)TermAttribute gets a special impl fo 
> toString() according to CharSequence and a corresponding iterator.
> I also want to remove the abstract hashCode() and equals() methods from 
> AttributeImpl, as they are not needed and just create work for the 
> implementor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982166#action_12982166
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

APIs have to be there still. All that commity, segment-deletery, mutabley stuff 
(that spans both atomic and composite readers).
So, while your plan is viable, it won't remove that much cruft.

> Separate SegmentReaders (and other atomic readers) from composite IndexReaders
> --
>
> Key: LUCENE-2858
> URL: https://issues.apache.org/jira/browse/LUCENE-2858
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Uwe Schindler
> Fix For: 4.0
>
>
> With current trunk, whenever you open an IndexReader on a directory you get 
> back a DirectoryReader which is a composite reader. The interface of 
> IndexReader has now lots of methods that simply throw UOE (in fact more than 
> 50% of all methods that are commonly used ones are unuseable now). This 
> confuses users and makes the API hard to understand.
> This issue should split "atomic readers" from "reader collections" with a 
> separate API. After that, you are no longer able, to get TermsEnum without 
> wrapping from those composite readers. We currently have helper classes for 
> wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
> Multi*), those should be retrofitted to implement the correct classes 
> (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
> reader as ctor param, maybe it could also simply take a List). 
> In my opinion, maybe composite readers could implement some collection APIs 
> and also have the ReaderUtil method directly built in (possibly as a "view" 
> in the util.Collection sense). In general composite readers do not really 
> need to look like the previous IndexReaders, they could simply be a 
> "collection" of SegmentReaders with some functionality like reopen.
> On the other side, atomic readers do not need reopen logic anymore? When a 
> segment changes, you need a new atomic reader? - maybe because of deletions 
> thats not the best idea, but we should investigate. Maybe make the whole 
> reopen logic simplier to use (ast least on the collection reader level).
> We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982132#action_12982132
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. Still, i think we would need this method (somewhere) even with CSF, so that 
people can change the norms and they instantly take effect for searches.
This still puzzles me. I can strain my imagination, and get people who just 
need to change norms without reindexing.
But doing this and *requiring* instant turnaround? Kid me not :)


> Separate SegmentReaders (and other atomic readers) from composite IndexReaders
> --
>
> Key: LUCENE-2858
> URL: https://issues.apache.org/jira/browse/LUCENE-2858
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Uwe Schindler
> Fix For: 4.0
>
>
> With current trunk, whenever you open an IndexReader on a directory you get 
> back a DirectoryReader which is a composite reader. The interface of 
> IndexReader has now lots of methods that simply throw UOE (in fact more than 
> 50% of all methods that are commonly used ones are unuseable now). This 
> confuses users and makes the API hard to understand.
> This issue should split "atomic readers" from "reader collections" with a 
> separate API. After that, you are no longer able, to get TermsEnum without 
> wrapping from those composite readers. We currently have helper classes for 
> wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
> Multi*), those should be retrofitted to implement the correct classes 
> (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
> reader as ctor param, maybe it could also simply take a List). 
> In my opinion, maybe composite readers could implement some collection APIs 
> and also have the ReaderUtil method directly built in (possibly as a "view" 
> in the util.Collection sense). In general composite readers do not really 
> need to look like the previous IndexReaders, they could simply be a 
> "collection" of SegmentReaders with some functionality like reopen.
> On the other side, atomic readers do not need reopen logic anymore? When a 
> segment changes, you need a new atomic reader? - maybe because of deletions 
> thats not the best idea, but we should investigate. Maybe make the whole 
> reopen logic simplier to use (ast least on the collection reader level).
> We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-15 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982126#action_12982126
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. Any comments about removing write access from IndexReaders? I think 
setNorms() will be removed soon, but how about the others like 
deleteDocument()? I would propose to also make all IndexReaders simply readers 
not writers? 

Voting with all my extremities - yes!!

> Separate SegmentReaders (and other atomic readers) from composite IndexReaders
> --
>
> Key: LUCENE-2858
> URL: https://issues.apache.org/jira/browse/LUCENE-2858
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Uwe Schindler
> Fix For: 4.0
>
>
> With current trunk, whenever you open an IndexReader on a directory you get 
> back a DirectoryReader which is a composite reader. The interface of 
> IndexReader has now lots of methods that simply throw UOE (in fact more than 
> 50% of all methods that are commonly used ones are unuseable now). This 
> confuses users and makes the API hard to understand.
> This issue should split "atomic readers" from "reader collections" with a 
> separate API. After that, you are no longer able, to get TermsEnum without 
> wrapping from those composite readers. We currently have helper classes for 
> wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
> Multi*), those should be retrofitted to implement the correct classes 
> (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
> reader as ctor param, maybe it could also simply take a List). 
> In my opinion, maybe composite readers could implement some collection APIs 
> and also have the ReaderUtil method directly built in (possibly as a "view" 
> in the util.Collection sense). In general composite readers do not really 
> need to look like the previous IndexReaders, they could simply be a 
> "collection" of SegmentReaders with some functionality like reopen.
> On the other side, atomic readers do not need reopen logic anymore? When a 
> segment changes, you need a new atomic reader? - maybe because of deletions 
> thats not the best idea, but we should investigate. Maybe make the whole 
> reopen logic simplier to use (ast least on the collection reader level).
> We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2868) It should be easy to make use of TermState; rewritten queries should be shared automatically

2011-01-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981774#action_12981774
 ] 

Earwin Burrfoot commented on LUCENE-2868:
-

We here use an intermediate query AST, with a number of walkers that do synonym 
substitution, optimization, caching, rewriting for multiple fields, and finally 
- generating a tree of Lucene Queries.

I can share a generic reflection-based visitor that's somewhat more handy than 
default visitor pattern in java.
Usage looks roughly like: 
{code}
class ToStringWalker extends DispatchingVisitor { // String here stands 
for the type of walk result
  String visit(TermQuery q) {
return "{term: " + q.getTerm() + "}";
  }

  String visit(BooleanQuery q) {
StringBuffer buf = new StringBuffer();
buf.append("{boolean: ");
for (BooleanQuery.Clause clause: q.clauses()) {
  buf.append(dispatch(clause.getQuery()).append(", "); // Here we 
}
buf.append("}");
return buf.toString();
  }

  String visit(SpanQuery q) { // Runs for all SpanQueries
.
  }

  String visit(Query q) { // Runs for all Queries not covered by a more exact 
visit() method 
..
  }
}

Query query = ...;
String stringRepresentation = new ToStringWalker().dispatch(query);
{code}

dispatch() checks its parameter runtime type, picks a visit()'s most close 
overload (according to java rules for compile-time overloaded method 
resolution), and invokes it.

> It should be easy to make use of TermState; rewritten queries should be 
> shared automatically
> 
>
> Key: LUCENE-2868
> URL: https://issues.apache.org/jira/browse/LUCENE-2868
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Query/Scoring
>Reporter: Karl Wright
> Attachments: query-rewriter.patch
>
>
> When you have the same query in a query hierarchy multiple times, tremendous 
> savings can now be had if the user knows enough to share the rewritten 
> queries in the hierarchy, due to the TermState addition.  But this is clumsy 
> and requires a lot of coding by the user to take advantage of.  Lucene should 
> be smart enough to share the rewritten queries automatically.
> This can be most readily (and powerfully) done by introducing a new method to 
> Query.java:
> Query rewriteUsingCache(IndexReader indexReader)
> ... and including a caching implementation right in Query.java which would 
> then work for all.  Of course, all callers would want to use this new method 
> rather than the current rewrite().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2011-01-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981388#action_12981388
 ] 

Earwin Burrfoot commented on LUCENE-2324:
-

Maan, this comment list is infinite.
How do I currently get the ..er.. current version? Latest branch + latest 
Jason's patch?

Regardless of everything else, I'd ask you not to extend random things :) at 
least if you can't say is-a about them.
DocumentsWriterPerThreadPool.ThreadState IS A ReentrantLock? No. So you're 
better off encapsulating it rather than extending.
Same can be applied to SegmentInfos that extends Vector :/

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, LUCENE-2324-SMALL.patch, 
> lucene-2324.patch, lucene-2324.patch, LUCENE-2324.patch, test.out, test.out
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2863) Updating a documenting looses its fields that only indexed, also NumericField tries are completely lost

2011-01-12 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980965#action_12980965
 ] 

Earwin Burrfoot commented on LUCENE-2863:
-

updateDocument() is an atomic version of deleteDocument() + addDocument(), 
nothing more

and there's nothing surprising you lose your fields if you delete the doc and 
don't add them back later.

> Updating a documenting looses its fields that only indexed, also NumericField 
> tries are completely lost
> ---
>
> Key: LUCENE-2863
> URL: https://issues.apache.org/jira/browse/LUCENE-2863
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Store
>Affects Versions: 3.0.2, 3.0.3
> Environment: WindowsXP, Java1.6.20 using a RamDirectory
>Reporter: Tamas Sandor
>Priority: Blocker
>
> I have a code snippet (see below) which creates a new document with standard 
> (stored, indexed), *not-stored, indexed-only* and some *NumericFields*. Then 
> it updates the document via adding a new string field. The result is that all 
> the fields that are not stored but indexed-only and especially NumericFields 
> the trie tokens are completly lost from index after update or delete/add.
> {code:java}
> Directory ramDir = new RamDirectory();
> IndexWriter writer = new IndexWriter(ramDir, new WhitespaceAnalyzer(), 
> MaxFieldLength.UNLIMITED);
> Document doc = new Document();
> doc.add(new Field("ID", "HO1234", Store.YES, Index.NOT_ANALYZED_NO_NORMS));
> doc.add(new Field("PATTERN", "HELLO", Store.NO, Index.NOT_ANALYZED_NO_NORMS));
> doc.add(new NumericField("LAT", Store.YES, 
> true).setDoubleValue(51.48826603066d));
> doc.add(new NumericField("LNG", Store.YES, 
> true).setDoubleValue(-0.08913399651646614d));
> writer.addDocument(doc);
> doc = new Document();
> doc.add(new Field("ID", "HO", Store.YES, Index.NOT_ANALYZED_NO_NORMS));
> doc.add(new Field("PATTERN", "BELLO", Store.NO, Index.NOT_ANALYZED_NO_NORMS));
> doc.add(new NumericField("LAT", Store.YES, 
> true).setDoubleValue(101.48826603066d));
> doc.add(new NumericField("LNG", Store.YES, 
> true).setDoubleValue(-100.08913399651646614d));
> writer.addDocument(doc);
> Term t = new Term("ID", "HO1234");
> Query q = new TermQuery(t);
> IndexSearcher seacher = new IndexSearcher(writer.getReader());
> TopDocs hits = seacher.search(q, 1);
> if (hits.scoreDocs.length > 0) {
>   Document ndoc = seacher.doc(hits.scoreDocs[0].doc);
>   ndoc.add(new Field("FINAL", "FINAL", Store.YES, 
> Index.NOT_ANALYZED_NO_NORMS));
>   writer.updateDocument(t, ndoc);
> //  writer.deleteDocuments(q);
> //  writer.addDocument(ndoc);
> } else {
>   LOG.info("Couldn't find the document via the query");
> }
> seacher = new IndexSearcher(writer.getReader());
> hits = seacher.search(new TermQuery(new Term("PATTERN", "HELLO")), 1);
> LOG.info("_hits HELLO:" + hits.totalHits); // should be 1 but it's 0
> writer.close();
> {code}
> And I have a boundingbox query based on *NumericRangeQuery*. After the 
> document update it doesn't return any hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980736#action_12980736
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

{quote}
As I said before though, i wouldn't mind if we had something more like a 
'modules/native' and FSDirectory checked, if this was available and 
automagically used it...
but I can't see myself thinking that we should put this logic into fsdir 
itself, sorry. 
{quote}
I'm perfectly OK with that approach (having some module FSDir checks). I also 
feel uneasy having JNI in core.
What I don't want to see, is Directory impls that you can't use on their own. 
If you can only use it for merging, then it's not a Directory, it breaks the 
contract! - move the code elsewhere.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980732#action_12980732
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

bq. Because in your example code above, it looks like it's added to Directory 
itself.
bq. My problem with your sample code is that it appears that the .setBufferSize 
method is on Directory itself. 

Ohoho. My fault, sorry. It should look like:
{code}
RAMDirectory ramDir = new RAMDirectory();
ramDir.setBufferSize(whatever) // Compilation error!
ramDir.createIndexInput(name, context);

NIOFSDirectory fsDir = new NIOFSDirectory();
fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
fsDir.setBufferSize(IOContext.MERGE, 4096);
fsDir.createIndexInput(name, context)
{code}

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-12 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980649#action_12980649
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

What's with ongoing crazyness? :)

bq. DirectIOLinuxDirectory
First you introduce a kind of directory that is utterly useless except certain 
special situations. Then, instead of fixing the directory/folding its code 
somewhere normal, you try to workaround by switching between directories. 
What's the point of using abstract classes or interfaces, if you leak their 
implementation's logic all over the place?
Or making DIOLD wrap something. Yeah! Wrap my RAMDir!

bq. bufferSize
This value is only meaningful to a certain subset of Directory implementations. 
So the only logical place we want to see this value set - is these very impls.
Sample code:
{code}
Directory ramDir = new RAMDirectory();
ramDir.createIndexInput(name, context);
// See, ma? No bufferSizes, they are pointless for RAMDir

Directory fsDir = new NIOFSDirectory();
fsDir.setBufferSize(IOContext.NORMAL_READ, 1024);
fsDir.setBufferSize(IOContext.MERGE, 4096);
fsDir.createIndexInput(name, context)
// See, ma? The only one who's really concerned with 'actual' buffer size is 
this concrete Directory impl
// All client code is only concerned with the context.
// It's NIOFSDirectory's business to give meaningful interpretation for 
IOContext and assign the buffer sizes.
{code}

You don't need custom Directory impls to make DIOLD work, you should freakin' 
fix it.
The proper way is to test out the things, and then move DirectIO code to the 
only place it makes sense in - FSDir? Probably make it switch on/off-able, 
maybe not.

You don't need custom Directory impls to set buffer sizes (neither cast to 
BufferedIndexInput!), you should add the setting to these Directories, which 
make sense of it.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980458#action_12980458
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

In fact, I suggest dropping bufferSize altogether. As far as I can recall, it 
was introduced as a precursor to IOContext and can now be safely replaced.

Even if we want to give user control over buffer size for all streams, or only 
those opened in specific IOContext, he can pass these numbers as config 
parameters to his Directory impl.
That makes total sense, as:
1. IndexWriter/IndexReader couldn't care less about buffer sizes, it just 
passes them to the Directory. It's not their concern.
2. A bunch of Directories doesn't use said bufferSize at all, making this 
parameter not only private Directory affairs, but even further - 
implementation-specific.

So my bet is - introduce IOContext as a simple Enum, change bufferSize 
parameter on createInput/Output to IOContext, done.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980454#action_12980454
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

{quote}
bq. You get IOFactory from Directory
That's for the default, the main use is the static IOFactory class.
{quote}
You lost me here. If you got A from B, you don't have to pass B again to invoke 
A, if you do - that's 99% a design mistake.
But still, my point was that you don't need IOFactory at all.

bq. Right, however we're basically trying to intermix Directory's, which 
doesn't work when pointed at the same underlying File. I thought about a 
meta-Directory that routes based on the IOContext, however we'd still need a 
way to create an IndexInput and IndexOutput, from different Directory 
implementations. 
What Directories are you trying to intermix? What for?

I thought the only thing done in that issue is an attempt to give Directory 
hints as to why we're going to open its streams.
A simple enum IOContext and extra parameter on createOutput/Input would 
suffice. But with Lucene's micromanagement attitude, an enum turns into 
slightly more complex thing, with bufferSizes and whatnot.
Still - no need for mixing Directories.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980448#action_12980448
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A SegmentListener that has a number of children SLs and delegates 
eventHappened() calls to them. 

> Create IndexWriter event listener, specifically for merges
> --
>
> Key: LUCENE-2856
> URL: https://issues.apache.org/jira/browse/LUCENE-2856
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 4.0
>Reporter: Jason Rutherglen
> Attachments: LUCENE-2856.patch
>
>
> The issue will allow users to monitor merges occurring within IndexWriter 
> using a callback notifier event listener.  This can be used by external 
> applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2793) Directory createOutput and openInput should take an IOContext

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980400#action_12980400
 ] 

Earwin Burrfoot commented on LUCENE-2793:
-

Looks crazy. In a -bad- tangled way.
You get IOFactory from Directory, put into IOContext, and then invoke it, 
passing it (wow!) an IOContext and a Directory. What if you pass totally 
different Directory? Different IOContext? It blows up eerily.

And there's no justification for this - we already have an IOFactory, it's 
called Directory! It just needs an extra parameter on its factory methods 
(createInput/Output), that's all.

> Directory createOutput and openInput should take an IOContext
> -
>
> Key: LUCENE-2793
> URL: https://issues.apache.org/jira/browse/LUCENE-2793
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Michael McCandless
> Attachments: LUCENE-2793.patch
>
>
> Today for merging we pass down a larger readBufferSize than for searching 
> because we get better performance.
> I think we should generalize this to a class (IOContext), which would hold 
> the buffer size, but then could hold other flags like DIRECT (bypass OS's 
> buffer cache), SEQUENTIAL, etc.
> Then, we can make the DirectIOLinuxDirectory fully usable because we would 
> only use DIRECT/SEQUENTIAL during merging.
> This will require fixing how IW pools readers, so that a reader opened for 
> merging is not then used for searching, and vice/versa.  Really, it's only 
> all the open file handles that need to be different -- we could in theory 
> share del docs, norms, etc, if that were somehow possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2856) Create IndexWriter event listener, specifically for merges

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980390#action_12980390
 ] 

Earwin Burrfoot commented on LUCENE-2856:
-

A CompositeSegmentListener niftily removes the need for collection.

> Create IndexWriter event listener, specifically for merges
> --
>
> Key: LUCENE-2856
> URL: https://issues.apache.org/jira/browse/LUCENE-2856
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 4.0
>Reporter: Jason Rutherglen
> Attachments: LUCENE-2856.patch
>
>
> The issue will allow users to monitor merges occurring within IndexWriter 
> using a callback notifier event listener.  This can be used by external 
> applications such as Solr to monitor large segment merges.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2858) Separate SegmentReaders (and other atomic readers) from composite IndexReaders

2011-01-11 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980388#action_12980388
 ] 

Earwin Burrfoot commented on LUCENE-2858:
-

bq. On the other side, atomic readers do not need reopen logic anymore? When a 
segment changes, you need a new atomic reader?
There is a freakload of places that "upgrade" SegmentReader in various ways, 
with deletions guilty only for the part of the cases. I'll try getting back to 
LUCENE-2355 at the end of the week.

> Separate SegmentReaders (and other atomic readers) from composite IndexReaders
> --
>
> Key: LUCENE-2858
> URL: https://issues.apache.org/jira/browse/LUCENE-2858
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Uwe Schindler
> Fix For: 4.0
>
>
> With current trunk, whenever you open an IndexReader on a directory you get 
> back a DirectoryReader which is a composite reader. The interface of 
> IndexReader has now lots of methods that simply throw UOE (in fact more than 
> 50% of all methods that are commonly used ones are unuseable now). This 
> confuses users and makes the API hard to understand.
> This issue should split "atomic readers" from "reader collections" with a 
> separate API. After that, you are no longer able, to get TermsEnum without 
> wrapping from those composite readers. We currently have helper classes for 
> wrapping (SlowMultiReaderWrapper - please rename, the name is really ugly; or 
> Multi*), those should be retrofitted to implement the correct classes 
> (SlowMultiReaderWrapper would be an atomic reader but takes a composite 
> reader as ctor param, maybe it could also simply take a List). 
> In my opinion, maybe composite readers could implement some collection APIs 
> and also have the ReaderUtil method directly built in (possibly as a "view" 
> in the util.Collection sense). In general composite readers do not really 
> need to look like the previous IndexReaders, they could simply be a 
> "collection" of SegmentReaders with some functionality like reopen.
> On the other side, atomic readers do not need reopen logic anymore? When a 
> segment changes, you need a new atomic reader? - maybe because of deletions 
> thats not the best idea, but we should investigate. Maybe make the whole 
> reopen logic simplier to use (ast least on the collection reader level).
> We should decide about good names, i have no preference at the moment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2474) Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean custom caches that use the IndexReader (getFieldCacheKey)

2011-01-10 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979888#action_12979888
 ] 

Earwin Burrfoot commented on LUCENE-2474:
-

bq. Earwin's working on improving this, I think, under LUCENE-2355
I stalled, and then there were just so many changes under trunk, so I have to 
restart now :) Thanks for another kick.

> Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
> custom caches that use the IndexReader (getFieldCacheKey)
> 
>
> Key: LUCENE-2474
> URL: https://issues.apache.org/jira/browse/LUCENE-2474
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Shay Banon
> Attachments: LUCENE-2474.patch, LUCENE-2474.patch
>
>
> Allow to plug in a Cache Eviction Listener to IndexReader to eagerly clean 
> custom caches that use the IndexReader (getFieldCacheKey).
> A spin of: https://issues.apache.org/jira/browse/LUCENE-2468. Basically, its 
> make a lot of sense to cache things based on IndexReader#getFieldCacheKey, 
> even Lucene itself uses it, for example, with the CachingWrapperFilter. 
> FieldCache enjoys being called explicitly to purge its cache when possible 
> (which is tricky to know from the "outside", especially when using NRT - 
> reader attack of the clones).
> The provided patch allows to plug a CacheEvictionListener which will be 
> called when the cache should be purged for an IndexReader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2312) Search on IndexWriter's RAM Buffer

2011-01-10 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979522#action_12979522
 ] 

Earwin Burrfoot commented on LUCENE-2312:
-

Some questions to align myself with impending reality.

Is that right that future RT readers are no longer immutable snapshots (in a 
sense that they have variable maxDoc)?
If it is so, are you keeping current NRT mode, with fast turnaround, yet 
immutable readers?

> Search on IndexWriter's RAM Buffer
> --
>
> Key: LUCENE-2312
> URL: https://issues.apache.org/jira/browse/LUCENE-2312
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: Realtime Branch
>Reporter: Jason Rutherglen
>Assignee: Michael Busch
> Fix For: Realtime Branch
>
> Attachments: LUCENE-2312-FC.patch, LUCENE-2312.patch
>
>
> In order to offer user's near realtime search, without incurring
> an indexing performance penalty, we can implement search on
> IndexWriter's RAM buffer. This is the buffer that is filled in
> RAM as documents are indexed. Currently the RAM buffer is
> flushed to the underlying directory (usually disk) before being
> made searchable. 
> Todays Lucene based NRT systems must incur the cost of merging
> segments, which can slow indexing. 
> Michael Busch has good suggestions regarding how to handle deletes using max 
> doc ids.  
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841923&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841923
> The area that isn't fully fleshed out is the terms dictionary,
> which needs to be sorted prior to queries executing. Currently
> IW implements a specialized hash table. Michael B has a
> suggestion here: 
> https://issues.apache.org/jira/browse/LUCENE-2293?focusedCommentId=12841915&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12841915

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979366#action_12979366
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. Nope, havent looked at their code... i think i stopped at the documentation 
when i saw how they analyzed text!
All my points are contained within their documentation. No need to look at the 
code (it's as shady as Lucene's).
In the same manner, Lucene had crappy analyzis for years, until you've taken 
hold of (unicode) police baton.
So let's not allow color differences between our analyzers affect our judgement 
on other parts of ours : )

bq. In other words, Test2BTerms in src/test should pass on my 32-bit windows 
machine with whatever we default to.
I'm questioning is there any legal, adequate reason to have that much terms.
I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms 
though :/

A hybrid solution, with term-dict being loaded completely into memory (either 
via mmap, or into arrays) on per-field basis, is probably best in the end, 
however sad it may be.

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979346#action_12979346
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

bq. I don't like the reasoning that, just because sphinx does it and their 
'users manage', that makes it ok.
I'm in no way advocating it as an all-round better solution. It has it's 
wrinkles just as anything else.
My reasoning is merely that alternative exists, and it is viable. As proven by 
pretty high-profile users.
They have memory-resident term dictionary, and it works, I heard no complaints 
regarding this ever.

bq. sphinx also requires mysql
Have you read anything at all? It has an integration ready, for the layman user 
who just wants to stick a fulltext search into their little app, but it is in 
no way reliant on it.
Sphinx is a direct alternative to Solr.

{quote}
But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good 
decisions on what gets swapped out is risky - Lucene is better informed than 
the OS on which data structures are worth spending RAM on (norms, terms index, 
field cache, del docs).
If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, 
then we should load it into RAM (and do away w/ the terms index).
{quote}
That's a bit delusional. If a system is forced to swap out, it'll swap your 
explicitly managed RAM just as likely as memory-mapped files. I've seen this 
countless times.
But then, you have a number of benefits - like sharing filesystem cache when 
opening same file multiple times, offloading things from Java heap (which is 
almost always a good thing), fastest load-into-memory times possible.


Sorry, if I sound offending at times, but, damn, there's a whole world of 
simple and efficient code lying ahead in that direction :)

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979306#action_12979306
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

A lot of fork-join type frameworks don't even care. Even though scheduling 
threads is something people supposedly use them for.
Why? I guess that's due to low yield/cost ratio.
You frequently quote "progress, not perfection" in relation to the code, but 
why don't we apply this same principle to our threading guarantees?
I don't want to use allowed concurrency fully. That's not realistic. I want 85% 
of it. That's already a huge leap ahead of single-threaded searches.


> Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
> ParallelMultiSearcher)
> ---
>
> Key: LUCENE-2840
> URL: https://issues.apache.org/jira/browse/LUCENE-2840
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Search
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 4.0
>
>
> Spin-off from parent issue:
> {quote}
> We should discuss about how many threads should be spawned. If you have an 
> index with many segments, even small ones, I think only the larger segments 
> should be separate threads, all others should be handled sequentially. So 
> maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
> only spawn maxThreads-1 threads for the bigger readers and then one 
> additional thread for the rest?
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979305#action_12979305
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

As I said, there's already a search server with strictly in-memory (in mmap 
sense. it can theoretically be paged out) terms dict AND widespread adoption. 
Their users somehow manage.

My guess is that's because people with "insane number of terms" store various 
crap like unique timestamps as terms. With CSF ("attributes" in Sphinx lingo), 
and some nice filters that can work over CSF, there's no longer any need to 
stuff your timestamps in the same place you stuff your texts. That can be 
reflected in documentation, and then, suddenly, we can drop "on-disk only" 
support.

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979277#action_12979277
 ] 

Earwin Burrfoot commented on LUCENE-2843:
-

And we're nearing a day when we keep the whole term dictionary in memory (as 
Sphinx does for instance).
At that point a gazillion of term lookup-related hacks (like lookup cache) 
become obsolete :)
Term dictionary itself can also be memory-mapped after this, instead of being 
"read" and "built" from disk, which makes new segment opening 
near-instantaneous.

> Add variable-gap terms index impl.
> --
>
> Key: LUCENE-2843
> URL: https://issues.apache.org/jira/browse/LUCENE-2843
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 4.0
>
> Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2011-01-09 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979276#action_12979276
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

bq. But doesn't that mean that an app w/ rare queries but each query is massive 
fails to use all available concurrency?
Yes. But that's not my case. And likely not someone else's.

I think if you want to be super-generic, it's better to defer exact threading 
to the user, instead of doing a one-size-fits-all solution. Else you risk 
conjuring another ConcurrentMergeScheduler.
While we're at it, we can throw in some sample implementation, which can 
satisfy some of the users, but not everyone.

> Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
> ParallelMultiSearcher)
> ---
>
> Key: LUCENE-2840
> URL: https://issues.apache.org/jira/browse/LUCENE-2840
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Search
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 4.0
>
>
> Spin-off from parent issue:
> {quote}
> We should discuss about how many threads should be spawned. If you have an 
> index with many segments, even small ones, I think only the larger segments 
> should be separate threads, all others should be handled sequentially. So 
> maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
> only spawn maxThreads-1 threads for the bigger readers and then one 
> additional thread for the rest?
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2840) Multi-Threading in IndexSearcher (after removal of MultiSearcher and ParallelMultiSearcher)

2010-12-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976027#action_12976027
 ] 

Earwin Burrfoot commented on LUCENE-2840:
-

I use the following scheme:
* There is a fixed pool of threads shared by all searches, that limits total 
concurrency.
* Each new search apprehends at most a fixed number of threads from this pool 
(say, 2-3 of 8 in my setup),
* and these threads churn through segments as through a queue (in maxDoc order, 
but I think even that is unnecessary).

No special smart binding between threads and segments (eg. 1 thread for each 
biggie, 1 thread for all of the small ones) -
means simpler code, and zero possibility of stalling, when there are threads to 
run, segments to search, but binding policy does not connect them.
Using fewer threads per-search than total available is a precaution against 
biggie searches blocking fast ones.

> Multi-Threading in IndexSearcher (after removal of MultiSearcher and 
> ParallelMultiSearcher)
> ---
>
> Key: LUCENE-2840
> URL: https://issues.apache.org/jira/browse/LUCENE-2840
> Project: Lucene - Java
>  Issue Type: Sub-task
>  Components: Search
>Reporter: Uwe Schindler
>Priority: Minor
> Fix For: 4.0
>
>
> Spin-off from parent issue:
> {quote}
> We should discuss about how many threads should be spawned. If you have an 
> index with many segments, even small ones, I think only the larger segments 
> should be separate threads, all others should be handled sequentially. So 
> maybe add a maxThreads cound, then sort the IndexReaders by maxDoc and then 
> only spawn maxThreads-1 threads for the bigger readers and then one 
> additional thread for the rest?
> {quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2825) FSDirectory.open should return MMap on 64-bit Solaris

2010-12-26 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12975212#action_12975212
 ] 

Earwin Burrfoot commented on LUCENE-2825:
-

{quote}
bq. CPU cache / TLB effects? Using more address space isn't completely free.
In the case of Solaris I think there is much less of a chance of TLB effects?
For example on Solaris java automatically uses large pages (unlike Linux, 
Windows etc).
{quote}

For, like, 13Gb of memory-mapped index, I've seen no noticeable difference 
between having large pages on and off under Linux. That's some anecdotal 
evidence, as I'ven't done any extensive research, but still.

> FSDirectory.open should return MMap on 64-bit Solaris
> -
>
> Key: LUCENE-2825
> URL: https://issues.apache.org/jira/browse/LUCENE-2825
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Robert Muir
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2825.patch
>
>
> MMap is ~ 30% faster than NIOFS on this platform.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance

2010-12-22 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974350#action_12974350
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Nobody halts your progress, we're merely discussing.

I, on the other hand, have a feeling that Lucene is overflowing with "single 
incremental improvements" aka "hacks", as they are easier and faster to 
implement than trying to get a bigger picture, and, yes, rebuilding everything 
:)
For example, better term dict code will make this issue (somewhat hackish, 
admit it?) irrelevant. Whether we implement bloom filters, or just guarantee to 
keep the whole term dict in memory with reasonable lookup routine (eg. as FST).

Having said that, I reiterate, I'm not here to stop you or turn this issue into 
something else.

> improve termquery "pk lookup" performance
> -
>
> Key: LUCENE-2829
> URL: https://issues.apache.org/jira/browse/LUCENE-2829
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Robert Muir
> Attachments: LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst 
> case is primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
> concerned we could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in 
> termquery... we could just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2829) improve termquery "pk lookup" performance

2010-12-22 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12974274#action_12974274
 ] 

Earwin Burrfoot commented on LUCENE-2829:
-

Term lookup misses can be alleviated by a simple Bloom Filter.
No caching misses required, helps both PK and near-PK queries.

> improve termquery "pk lookup" performance
> -
>
> Key: LUCENE-2829
> URL: https://issues.apache.org/jira/browse/LUCENE-2829
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Robert Muir
> Attachments: LUCENE-2829.patch
>
>
> For things that are like primary keys and don't exist in some segments (worst 
> case is primary/unique key that only exists in 1)
> we do wasted seeks.
> While LUCENE-2694 tries to solve some of this issue with TermState, I'm 
> concerned we could every backport that to 3.1 for example.
> This is a simpler solution here just to solve this one problem in 
> termquery... we could just revert it in trunk when we resolve LUCENE-2694,
> but I don't think we should leave things as they are in 3.x

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-18 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

Synced to trunk.

bq. Also, on the nocommit on exc in DW.addDocument, yes I think that 
(IFD.deleteNewFiles, not checkpoint) is still needed because DW can orphan the 
store files on abort?
Orphaned files are deleted directly in StoredFieldsWriter.abort() and 
TermVectorsTermsWriter.abort(). As I said - all the open files tracking is now 
gone.
Turns out checkpoint() is also no longer needed.

I have no other lingering cleanup urges, this is ready to be committed. I think.

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, 
> LUCENE-2814.patch, LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2818:


Priority: Minor  (was: Major)

This change is really minor, but I think, convinient.

You don't have to lug reference to Directory along, and recalculate the file 
name, if the only thing you want to say is that write was a failure and you no 
longer need this file.

> abort() method for IndexOutput
> --
>
> Key: LUCENE-2818
> URL: https://issues.apache.org/jira/browse/LUCENE-2818
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
>Priority: Minor
>
> I'd like to see abort() method on IndexOutput that silently (no exceptions) 
> closes IO and then does silent papaDir.deleteFile(this.fileName()).
> This will simplify a bunch of error recovery code for IndexWriter and 
> friends, but constitutes an API backcompat break.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972765#action_12972765
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. I think we can make a default impl that simply closes & suppresses 
exceptions? (We can't .deleteFile since an abstract IO doesn't know its Dir). 
Our concrete impls can override w/ versions that do delete the file...
I don't think we need a default impl? For some directory impls close() is a 
noop + what is more important, having abstract method forces you to implement 
it, you can't forget this, so we're not gonna see broken directories that don't 
do abort() properly.

> abort() method for IndexOutput
> --
>
> Key: LUCENE-2818
> URL: https://issues.apache.org/jira/browse/LUCENE-2818
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
>
> I'd like to see abort() method on IndexOutput that silently (no exceptions) 
> closes IO and then does silent papaDir.deleteFile(this.fileName()).
> This will simplify a bunch of error recovery code for IndexWriter and 
> friends, but constitutes an API backcompat break.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2818) abort() method for IndexOutput

2010-12-18 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972764#action_12972764
 ] 

Earwin Burrfoot commented on LUCENE-2818:
-

bq. Can abort() have a default impl in IndexOutput, such as close() followed by 
deleteFile() maybe? If so, then it won't break anything.
It can't. To call deleteFile you need both a reference to papa-Directory and a 
name of the file this IO writes to. Abstract IO class has neither. If we add 
them, they have to be passed to a new constructor, and that's an API break ;)

bq. Would abort() on Directory fit better? E.g., it can abort all currently 
open and modified files, instead of the caller calling abort() on each 
IndexOutput? Are you thinking of a case where a write failed, and the caller 
would call abort() immediately, instead of some higher-level code? If so, would 
rollback() be a better name?
Oh, no, no. No way. I don't want to push someone else's responsibility on 
Directory. This abort() is merely a shortcut.

Let's go with a usage example:
Here's FieldsWriter.java with LUCENE-2814 applied (skipping irrelevant parts) - 
https://gist.github.com/746358
Now, the same, with abort() - https://gist.github.com/746367

> abort() method for IndexOutput
> --
>
> Key: LUCENE-2818
> URL: https://issues.apache.org/jira/browse/LUCENE-2818
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Earwin Burrfoot
>
> I'd like to see abort() method on IndexOutput that silently (no exceptions) 
> closes IO and then does silent papaDir.deleteFile(this.fileName()).
> This will simplify a bunch of error recovery code for IndexWriter and 
> friends, but constitutes an API backcompat break.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-17 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

New patch. Now with even more lines removed!

DocStore-related index chain components used to track open/closed files through 
DocumentsWriter.
Closed files list was unused, and is silently gone.
Open files list was used to:
* prevent not-yet-flushed shared docstores from being deleted by 
IndexFileDeleter.
** no shared docstores, no need + IFD no longer requires a reference to DW
* delete already opened docstore files, when aborting.
** index chain now handles this on its own + has cleaner error handling code.

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch, 
> LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2818) abort() method for IndexOutput

2010-12-17 Thread Earwin Burrfoot (JIRA)

abort() method for IndexOutput
--

 Key: LUCENE-2818
 URL: https://issues.apache.org/jira/browse/LUCENE-2818
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Earwin Burrfoot


I'd like to see abort() method on IndexOutput that silently (no exceptions) 
closes IO and then does silent papaDir.deleteFile(this.fileName()).
This will simplify a bunch of error recovery code for IndexWriter and friends, 
but constitutes an API backcompat break.

What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-16 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

Patch updated to trunk, no nocommits, no *.closeDocStore(), tests pass.

SegmentWriteState vs DocumentsWriter bother me.
We track flushed files in both, we inconsistently get current segment from both 
of them.


> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch, LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972316#action_12972316
 ] 

Earwin Burrfoot commented on LUCENE-2814:
-

Instead of you pulling out docstore removal, I can finish that patch. But then 
merging's gonna be even greater bitch. Probably. But maybe not.
Do you do IRC? It can be faster to discuss in realtime, and you could also tell 
what help you need with the branch.

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972298#action_12972298
 ] 

Earwin Burrfoot commented on LUCENE-2814:
-

So, what's the plan?

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-16 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972259#action_12972259
 ] 

Earwin Burrfoot commented on LUCENE-2814:
-

Ugh.
On first glance @ realtime branch, my patch is not a strict subset. But most 
stuff, including things I'm struggling with now, is definetly in.
So, I guess, I'll stop wasting effort.

What is the state for realtime branch? Is it stable? How soon is "sometime 
soon"? :)

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch, LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-16 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2814:


Attachment: LUCENE-2814.patch

First iteration.

Passes all tests except TestNRTThreads. Something to do with numDocsInStore and 
numDocsInRam merged together?
Lots of non-critical nocommits (just markers for places I'd like to recheck).
DW.docStoreEnabled and *.closeDocStore() have to go, before committing

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Attachments: LUCENE-2814.patch
>
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2811) SegmentInfo should explicitly track whether that segment wrote term vectors

2010-12-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971510#action_12971510
 ] 

Earwin Burrfoot commented on LUCENE-2811:
-

>From IRC:
SegmentMerger.hasVectors carries no new information compared to 
OneMerge.hasVectors, and can be dropped.
OneMerge.hasVectors is initialized just near OneMerge.info, and is later used 
to set OneMerge.info.hasVectors, might as well do that from the get go and drop 
OM.hV.

> SegmentInfo should explicitly track whether that segment wrote term vectors
> ---
>
> Key: LUCENE-2811
> URL: https://issues.apache.org/jira/browse/LUCENE-2811
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2811.patch
>
>
> Today SegmentInfo doesn't know if it has vectors, which means its files() 
> method must check if the files exist.
> This leads to subtle bugs, because Si.files() caches the files but then we 
> fail to invalidate that later when the term vectors files are created.
> It also leads to sloppy code, eg TermVectorsReader "gracefully" handles being 
> opened when the files do not exist.  I don't like that; it should only be 
> opened if they exist.
> This also fixes these intermittent failures we've been seeing:
> {noformat}
> junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about 
> file _1e.tvx
>at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979)
>at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917)
>at 
> org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3633)
>at 
> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3699)
>at 
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2407)
>at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2478)
>at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2460)
>at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2444)
>at 
> org.apache.lucene.index.TestIndexWriterExceptions.testRandomExceptionsThreads(TestIndexWriterExceptions.java:213)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2811) SegmentInfo should explicitly track whether that segment wrote term vectors

2010-12-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971303#action_12971303
 ] 

Earwin Burrfoot commented on LUCENE-2811:
-

I think SegmentInfo.hasVectors should be a boolean.

If this is an old index, we can check the file presence in SegmentInfo 
constructor, set it properly, and on next write index is silently upgraded.

> SegmentInfo should explicitly track whether that segment wrote term vectors
> ---
>
> Key: LUCENE-2811
> URL: https://issues.apache.org/jira/browse/LUCENE-2811
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2811.patch
>
>
> Today SegmentInfo doesn't know if it has vectors, which means its files() 
> method must check if the files exist.
> This leads to subtle bugs, because Si.files() caches the files but then we 
> fail to invalidate that later when the term vectors files are created.
> It also leads to sloppy code, eg TermVectorsReader "gracefully" handles being 
> opened when the files do not exist.  I don't like that; it should only be 
> opened if they exist.
> This also fixes these intermittent failures we've been seeing:
> {noformat}
> junit.framework.AssertionFailedError: IndexFileDeleter doesn't know about 
> file _1e.tvx
>at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:979)
>at 
> org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:917)
>at 
> org.apache.lucene.index.IndexWriter.filesExist(IndexWriter.java:3633)
>at 
> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:3699)
>at 
> org.apache.lucene.index.IndexWriter.prepareCommit(IndexWriter.java:2407)
>at 
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2478)
>at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2460)
>at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2444)
>at 
> org.apache.lucene.index.TestIndexWriterExceptions.testRandomExceptionsThreads(TestIndexWriterExceptions.java:213)
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-14 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971248#action_12971248
 ] 

Earwin Burrfoot commented on LUCENE-2814:
-

bq. We should verify the back-compat test has at least one index w/ shared doc 
stores.
I believe I've seen some back-compat failures with my quick'n'dirty patch that 
removed both reading and writing shared docstores.
So it should be ok.

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2611) IntelliJ IDEA setup

2010-12-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971058#action_12971058
 ] 

Earwin Burrfoot edited comment on LUCENE-2611 at 12/13/10 5:36 PM:
---

{quote}
bq. I wonder if several .iml files can be in the same directory but their 
so-called "content-roots" would be set to where they are now?
I'm pretty sure IntelliJ allows only one .iml per directory.
{quote}
My IDEA project for Lucene has all .iml files in root dir. Various 
content-roots for sure.

I recall that worked for at least several major IDEA versions.

  was (Author: earwin):
[quote]
bq. I wonder if several .iml files can be in the same directory but their 
so-called "content-roots" would be set to where they are now?
I'm pretty sure IntelliJ allows only one .iml per directory.
[quote]
My IDEA project for Lucene has all .iml files in root dir. Various 
content-roots for sure.

I recall that worked for at least several major IDEA versions.
  
> IntelliJ IDEA setup
> ---
>
> Key: LUCENE-2611
> URL: https://issues.apache.org/jira/browse/LUCENE-2611
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test_2.patch
>
>
> Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming.
> The attached patch adds a new top level directory {{dev-tools/}} with sub-dir 
> {{idea/}} containing basic setup files for trunk, as well as a top-level ant 
> target named "idea" that copies these files into the proper locations.  This 
> arrangement avoids the messiness attendant to in-place project configuration 
> files directly checked into source control.
> The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
> Solr contrib, and each analysis module.  A JUnit test run per module is 
> included.
> Once {{ant idea}} has been run, the only configuration that must be performed 
> manually is configuring the project-level JDK.
> If this patch is committed, Subversion svn:ignore properties should be 
> added/modified to ignore the destination module files (*.iml) in each 
> module's directory.
> Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
> for applying the 3.X branch patch: 
> http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2611) IntelliJ IDEA setup

2010-12-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971058#action_12971058
 ] 

Earwin Burrfoot commented on LUCENE-2611:
-

[quote]
bq. I wonder if several .iml files can be in the same directory but their 
so-called "content-roots" would be set to where they are now?
I'm pretty sure IntelliJ allows only one .iml per directory.
[quote]
My IDEA project for Lucene has all .iml files in root dir. Various 
content-roots for sure.

I recall that worked for at least several major IDEA versions.

> IntelliJ IDEA setup
> ---
>
> Key: LUCENE-2611
> URL: https://issues.apache.org/jira/browse/LUCENE-2611
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.1, 4.0
>Reporter: Steven Rowe
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611-branch-3x.patch, 
> LUCENE-2611-branch-3x.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, LUCENE-2611.patch, 
> LUCENE-2611.patch, LUCENE-2611_mkdir.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test.patch, LUCENE-2611_test.patch, LUCENE-2611_test.patch, 
> LUCENE-2611_test_2.patch
>
>
> Setting up Lucene/Solr in IntelliJ IDEA can be time-consuming.
> The attached patch adds a new top level directory {{dev-tools/}} with sub-dir 
> {{idea/}} containing basic setup files for trunk, as well as a top-level ant 
> target named "idea" that copies these files into the proper locations.  This 
> arrangement avoids the messiness attendant to in-place project configuration 
> files directly checked into source control.
> The IDEA configuration includes modules for Lucene and Solr, each Lucene and 
> Solr contrib, and each analysis module.  A JUnit test run per module is 
> included.
> Once {{ant idea}} has been run, the only configuration that must be performed 
> manually is configuring the project-level JDK.
> If this patch is committed, Subversion svn:ignore properties should be 
> added/modified to ignore the destination module files (*.iml) in each 
> module's directory.
> Iam Jambour has written up on the Lucene wiki a detailed set of instructions 
> for applying the 3.X branch patch: 
> http://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2814) stop writing shared doc stores across segments

2010-12-13 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971057#action_12971057
 ] 

Earwin Burrfoot commented on LUCENE-2814:
-

I'll take this. I think.

> stop writing shared doc stores across segments
> --
>
> Key: LUCENE-2814
> URL: https://issues.apache.org/jira/browse/LUCENE-2814
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: 3.1, 4.0
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>
> Shared doc stores enables the files for stored fields and term vectors to be 
> shared across multiple segments.  We've had this optimization since 2.1 I 
> think.
> It works best against a new index, where you open an IW, add lots of docs, 
> and then close it.  In that case all of the written segments will reference 
> slices a single shared doc store segment.
> This was a good optimization because it means we never need to merge these 
> files.  But, when you open another IW on that index, it writes a new set of 
> doc stores, and then whenever merges take place across doc stores, they must 
> now be merged.
> However, since we switched to shared doc stores, there have been two 
> optimizations for merging the stores.  First, we now bulk-copy the bytes in 
> these files if the field name/number assignment is "congruent".  Second, we 
> now force congruent field name/number mapping in IndexWriter.  This means 
> this optimization is much less potent than it used to be.
> Furthermore, the optimization adds *a lot* of hair to 
> IndexWriter/DocumentsWriter; this has been the source of sneaky bugs over 
> time, and causes odd behavior like a merge possibly forcing a flush when it 
> starts.  Finally, with DWPT (LUCENE-2324), which gets us truly concurrent 
> flushing, we can no longer share doc stores.
> So, I think we should turn off the write-side of shared doc stores to pave 
> the path for DWPT to land on trunk and simplify IW/DW.  We still must support 
> reading them (until 5.0), but the read side is far less hairy.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2802) DirectoryReader ignores NRT SegmentInfos in #isOptimized()

2010-12-06 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968503#action_12968503
 ] 

Earwin Burrfoot commented on LUCENE-2802:
-

Patch looks cool.

> DirectoryReader ignores NRT SegmentInfos in #isOptimized()
> --
>
> Key: LUCENE-2802
> URL: https://issues.apache.org/jira/browse/LUCENE-2802
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 3.1, 4.0
>Reporter: Simon Willnauer
>Assignee: Simon Willnauer
> Attachments: LUCENE-2802.patch, LUCENE-2802.patch
>
>
> DirectoryReader  only takes shared (with IW) SegmentInfos into account in 
> DirectoryReader#isOptimized(). This can return true even if the actual 
> realtime reader sees more than one segments. 
> {code}
> public boolean isOptimized() {
> ensureOpen();
>// if segmentsInfos changes in IW this can return false positive
> return segmentInfos.size() == 1 && !hasDeletions();
>   }
> {code}
> DirectoryReader should check if this reader has a non-nul segmentInfosStart 
> and use that instead

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2802) DirectoryReader ignores NRT SegmentInfos in #isOptimized()

2010-12-06 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967238#action_12967238
 ] 

Earwin Burrfoot commented on LUCENE-2802:
-

Heh, I've 
[mentioned|https://issues.apache.org/jira/browse/LUCENE-2355?focusedCommentId=12874857&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12874857]
 this earlier in LUCENE-2355. :)

> DirectoryReader ignores NRT SegmentInfos in #isOptimized()
> --
>
> Key: LUCENE-2802
> URL: https://issues.apache.org/jira/browse/LUCENE-2802
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 4.0
>Reporter: Simon Willnauer
> Attachments: LUCENE-2802.patch
>
>
> DirectoryReader  only takes shared (with IW) SegmentInfos into account in 
> DirectoryReader#isOptimized(). This can return true even if the actual 
> realtime reader sees more than one segments. 
> {code}
> public boolean isOptimized() {
> ensureOpen();
>// if segmentsInfos changes in IW this can return false positive
> return segmentInfos.size() == 1 && !hasDeletions();
>   }
> {code}
> DirectoryReader should check if this reader has a non-nul segmentInfosStart 
> and use that instead

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2799) MMapDirectory not designed for inheritance

2010-12-04 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966865#action_12966865
 ] 

Earwin Burrfoot commented on LUCENE-2799:
-

I think it's always best to copy-paste stuff to your project and adapt it, in 
such cases.
It's simple, it doesn't break when you upgrade, it works with clumsily-built 
code.

Designing actual classes (not API interfaces, but fullblown working code) for 
inheritance is _very_ complex. I've seen few people trying this for real, and 
even fewer succeeding.

> MMapDirectory not designed for inheritance
> --
>
> Key: LUCENE-2799
> URL: https://issues.apache.org/jira/browse/LUCENE-2799
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Affects Versions: 3.0.3
>Reporter: René Treffer
>
> How to reproduce
> Try to inherit from MMapDirectory to change the openInput logic (open files 
> from different directories).
> Expected result:
> Inherit from MMapDirectory, overwrite the one method, done.
> Actual result:
> It's impossible to overwrite the method as the inner classes would be 
> missing. It's impossible to fork the inner classes as they depend on a final 
> method with default visibility (cleanMapping).
> It turns out to be the easiest option to completely for the code and replace 
> just the method in question.
> Possible fix:
> Change the visibility of most members and subtypes to be at least protected 
> and avoid the default visibility.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-03 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966550#action_12966550
 ] 

Earwin Burrfoot commented on LUCENE-2790:
-

Ok, let's commit?

There's no need to force first few commits to CFS. CFS' sole purporse is to 
keep number of simultaneously open files low. Not likely you gonna see 
frightening numbers with only a pair of segments in index.
Later these segments are merged (and probably CFSed), so no worries.

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, 
> LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2471) Supporting bulk copies in Directory

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966358#action_12966358
 ] 

Earwin Burrfoot commented on LUCENE-2471:
-

Hmmm. Are we going to do this?

Optimized bulk copies IndexInput -> IndexOutput for merges.

I currently see II.copyBytes(IndexOutput out, long numBytes) method in trunk,
but it's a little bit of a mess (II.copyBytes calls IO.copyBytes, strange 
overrides doing the same thing in various ways),
no optimizations for FSDirectory (or at least NIOFSdirectory) case,
no offset parameter?
not used when bulk-merging? (well, DataOutput.copyBytes _is_ used, but there's 
a single inefficient version of it)

> Supporting bulk copies in Directory
> ---
>
> Key: LUCENE-2471
> URL: https://issues.apache.org/jira/browse/LUCENE-2471
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Earwin Burrfoot
> Fix For: 3.1, 4.0
>
>
> A method can be added to IndexOutput that accepts IndexInput, and writes 
> bytes using it as a source.
> This should be used for bulk-merge cases (offhand - norms, docstores?). Some 
> Directories can then override default impl and skip intermediate buffers 
> (NIO, MMap, RAM?).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Closed: (LUCENE-2307) Spurious exception in TestIndexWriter

2010-12-02 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot closed LUCENE-2307.
---

Resolution: Cannot Reproduce

Never reproduced since, closing as stale and outdated

> Spurious exception in TestIndexWriter
> -
>
> Key: LUCENE-2307
> URL: https://issues.apache.org/jira/browse/LUCENE-2307
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 3.0.2
> Environment: MacOS X, Java 6
>Reporter: Earwin Burrfoot
>
> Happened on trunk:
> [junit] Testsuite: org.apache.lucene.index.TestIndexWriter
> [junit] Tests run: 106, Failures: 1, Errors: 0, Time elapsed: 18.567 sec
> [junit] 
> [junit] - Standard Output ---
> [junit] java.io.IOException: background merge hit exception: _0:c19 
> _1:c19 _2:c2 _3:c2 _4:c2 _5:c2 _6:c2 _7:c2 _8:c1 into _9 [optimize] 
> [mergeDocStores]
> [junit]   at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2341)
> [junit]   at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280)
> [junit]   at 
> org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2250)
> [junit]   at 
> org.apache.lucene.index.TestIndexWriter$CommitAndAddIndexes.doBody(TestIndexWriter.java:3759)
> [junit]   at 
> org.apache.lucene.index.TestIndexWriter$RunAddIndexesThreads$1.run(TestIndexWriter.java:3708)
> [junit] Caused by: 
> org.apache.lucene.index.MergePolicy$MergeAbortedException: merge is aborted: 
> _0:c19 _1:c19 _2:c2 _3:c2 _4:c2 _5:c2 _6:c2 _7:c2 _8:c1 into _9 [optimize] 
> [mergeDocStores]
> [junit]   at 
> org.apache.lucene.index.MergePolicy$OneMerge.checkAborted(MergePolicy.java:135)
> [junit]   at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4308)
> [junit]   at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3958)
> [junit]   at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
> [junit]   at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)
> [junit] -  ---
> [junit] Testcase: 
> testAddIndexesWithClose(org.apache.lucene.index.TestIndexWriter):   FAILED
> [junit] null
> [junit] junit.framework.AssertionFailedError: null
> [junit]   at 
> org.apache.lucene.index.TestIndexWriter.testAddIndexesWithClose(TestIndexWriter.java:3835)
> [junit]   at 
> org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:252)
> [junit] 
> 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966285#action_12966285
 ] 

Earwin Burrfoot commented on LUCENE-2790:
-

Shai, what about:
bq. My only concern is... it looks like addIndexes(IR[]), with compound file 
used in the end, may fail to delete the non-compound files once the SegmentInfo 
is committed?
I fixed everything else, but can't answer this question.

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, 
> LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2790:


Attachment: LUCENE-2790.patch

Fixed your test failure

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, 
> LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2790:


Attachment: LUCENE-2790.patch

Okay, this patch fixes remaining threading issue in IW.mergeMiddle,
and three tests that were expecting CFS segments and weren't getting ones
due to flush now respecting noCFSRatio and noCFSRatio default of 0.1

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch, LUCENE-2790.patch, 
> LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966112#action_12966112
 ] 

Earwin Burrfoot commented on LUCENE-2790:
-

bq. I checked who implements useCompoundFile and all I find is LogMP and NoMP, 
both don't iterate on the SegmentInfos. What MP did you test with?
Apply my patch, it changes LogMP to use SegmentInfos.

bq. So if you can paste here the 'ant test' that reproduces it, it'd be great.
ant test -Dtestcase=TestAddIndexes -Dtestmethod=testAddIndexesWithThreads 
-Dtests.seed=5369960668186287821:331425426639083833 -Dtests.codec=randomPerField
The test is threaded, so it doesn't fail always.

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earwin Burrfoot updated LUCENE-2790:


Attachment: LUCENE-2790.patch

Check this patch out.
It moves noCFS ratio to useCompoundFile(SIS, SI) and drops useCompoundFile from 
OneMerge, so all decisions about using compound files now happen in a single 
place.
It also highlights the problem with your patch - when calling useCompoundFile 
from addIndexes, you should hold a lock, so segmentInfos won't be modified 
while mergePolicy inspects them.

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966108#action_12966108
 ] 

Earwin Burrfoot edited comment on LUCENE-2790 at 12/2/10 8:12 AM:
--

Check this patch out.
It changes useCompoundFile(SIS, SI) to respect noCFSRatio and drops 
useCompoundFile from OneMerge, so all decisions about using compound files now 
happen in a single place.
It also highlights the problem with your patch - when calling useCompoundFile 
from addIndexes, you should hold a lock, so segmentInfos won't be modified 
while mergePolicy inspects them.

  was (Author: earwin):
Check this patch out.
It moves noCFS ratio to useCompoundFile(SIS, SI) and drops useCompoundFile from 
OneMerge, so all decisions about using compound files now happen in a single 
place.
It also highlights the problem with your patch - when calling useCompoundFile 
from addIndexes, you should hold a lock, so segmentInfos won't be modified 
while mergePolicy inspects them.
  
> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch, LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2790) IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966103#action_12966103
 ] 

Earwin Burrfoot commented on LUCENE-2790:
-

Fails addIndexesWithThreads with ConcurrentModificationException, if 
MergePolicy actually tries to iterate infos passed to useCompoundFile(SIS, SI).

> IndexWriter should call MP.useCompoundFile and not LogMP.getUseCompoundFile
> ---
>
> Key: LUCENE-2790
> URL: https://issues.apache.org/jira/browse/LUCENE-2790
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2790.patch
>
>
> Spin off from here: 
> http://www.gossamer-threads.com/lists/lucene/java-dev/112311.
> I will attach a patch shortly that addresses the issue on trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2789) Let codec decide to use compound file system or not

2010-12-02 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966071#action_12966071
 ] 

Earwin Burrfoot commented on LUCENE-2789:
-

I'd like to a see a switch like setNeverEverUseCompoundFiles(true) somewhere.
We'll still need to copypaste CFS handling code to each new Codec :/

> Let codec decide to use compound file system or not
> ---
>
> Key: LUCENE-2789
> URL: https://issues.apache.org/jira/browse/LUCENE-2789
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Codecs, Index
>Reporter: Simon Willnauer
>
> While working on LUCENE-2186  and in the context of recent [mails | 
> http://www.lucidimagination.com/search/document/e75cfa6050d5176/consolidate_mp_and_lmp#97c69a198952ebaa]
>  about consolidating MergePolicy and LogMergePolicy I wanna propose a rather 
> big change how Compund Files are created / handled in IW. Since Codecs have 
> been introduced we have several somewhat different way of how data is written 
> to the index. Sep codec for instance writes different files for index data 
> and DocValues will write one file per field and segment. Eventually codecs 
> need to have more control over how files are written ie. if CFS should be 
> used or not is IMO really  a matter of the codec used for writing.
> On the other hand when you look at IW internals CFS really pollutes the 
> indexing code and relies on information from inside a codec (see 
> SegmentWriteState.flusedFiles) actuall this differentiation spreads across 
> many classes related to indexing including the LogMergePolicy. IMO how new 
> flushed segments are written has nothing to do with MP in the first place and 
> MP currently choses whether a newly flushed segment is CFS or not (correct me 
> if I am wrong), pushing all this logic down to codecs would make lots of code 
> much easier and cleaner.
> As mike said this would also reduce the API footprint if we make it private 
> to the codec. I can imagine some situations where you really want control 
> over certain fields to be stored as non-CFS and other to be stored as CFS.  
> Codecs might need more information about other segments during a merge to 
> decide if or not to use CFS based on the segments size but we can easily 
> change that API. From a reading point of view we already have Codec#files 
> that can decide case by case what files belong to this codec.
> let me know the thoughts

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory

2010-11-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965380#action_12965380
 ] 

Earwin Burrfoot commented on LUCENE-2779:
-

bq. So I ended up writing the following code and comment
Looks good.

> Use ConcurrentHashMap in RAMDirectory
> -
>
> Key: LUCENE-2779
> URL: https://issues.apache.org/jira/browse/LUCENE-2779
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, 
> LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java
>
>
> RAMDirectory synchronizes on its instance in many places to protect access to 
> map of RAMFiles, in addition to updating the sizeInBytes member. In many 
> places the sync is done for 'read' purposes, while only in few places we need 
> 'write' access. This looks like a perfect use case for ConcurrentHashMap
> Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong 
> ...
> I'll post a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory

2010-11-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965296#action_12965296
 ] 

Earwin Burrfoot commented on LUCENE-2779:
-

Quoting Sun JDK 1.6:

{code}
public ArrayList(Collection c) {
  elementData = c.toArray();
  size = elementData.length;
  // c.toArray might (incorrectly) not return Object[] (see 6260652)
  if (elementData.getClass() != Object[].class)
elementData = Arrays.copyOf(elementData, size, Object[].class);
}
{code}

It calls toArray() on collection provided. You might as well skip wrapping with 
ArrayList and use toArray directly :D

> Use ConcurrentHashMap in RAMDirectory
> -
>
> Key: LUCENE-2779
> URL: https://issues.apache.org/jira/browse/LUCENE-2779
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, 
> LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java
>
>
> RAMDirectory synchronizes on its instance in many places to protect access to 
> map of RAMFiles, in addition to updating the sizeInBytes member. In many 
> places the sync is done for 'read' purposes, while only in few places we need 
> 'write' access. This looks like a perfect use case for ConcurrentHashMap
> Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong 
> ...
> I'll post a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2785) TopFieldCollector throws AIOOBE if numHits is 0

2010-11-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965202#action_12965202
 ] 

Earwin Burrfoot commented on LUCENE-2785:
-

"A Collector, that counts" - priceless.
And, yes, I have one of these too.

> TopFieldCollector throws AIOOBE if numHits is 0
> ---
>
> Key: LUCENE-2785
> URL: https://issues.apache.org/jira/browse/LUCENE-2785
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1, 4.0
>
>
> See solr-user thread "ArrayIndexOutOfBoundsException for query with rows=0 
> and sort param".
> I think we should just create a null collector (only tallies up totalHits) if 
> numHits is 0?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory

2010-11-30 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965194#action_12965194
 ] 

Earwin Burrfoot commented on LUCENE-2779:
-

bq. Cloning the keySet() will be exactly the 'snapshotty' behavior we're 
looking for. Before I made the change, you could call listAll(), lock RAMDir, 
return the array and before/after that files could be added/removed. W/ the 
clone, we'll get the same behavior - files can be added/removed before the 
clone, clone would reflect those changes, whatever happens after the clone is 
invisible to the iterator - hence why I consider it snapshotty.
There are still weird cases, when file B was added after deleting A, but you 
see both in listAll(). These - remain, so it's not a "point in time" it's more 
like a "span in time".
Whatever happened after toArray was invisible to array too, so the behaviour 
hasn't changed.

bq. Also, what do you know, I've hit an AIOB exception thrown from listAll() 
when it called toArray() :)
But _this_ fact is really interesting. toArray() with no parameters failed on 
CHM? Cloning has a meaning now :)

bq. Earwin, I did not *just* backport it. 
Didn't mean to offend anyone, sorry if I did.

> Use ConcurrentHashMap in RAMDirectory
> -
>
> Key: LUCENE-2779
> URL: https://issues.apache.org/jira/browse/LUCENE-2779
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, 
> LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java
>
>
> RAMDirectory synchronizes on its instance in many places to protect access to 
> map of RAMFiles, in addition to updating the sizeInBytes member. In many 
> places the sync is done for 'read' purposes, while only in few places we need 
> 'write' access. This looks like a perfect use case for ConcurrentHashMap
> Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong 
> ...
> I'll post a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory

2010-11-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964982#action_12964982
 ] 

Earwin Burrfoot commented on LUCENE-2779:
-

Maybe we should commit it to 4.0 only? Doesn't look like a really important 
patch, that just _has_ to be backported.

> Use ConcurrentHashMap in RAMDirectory
> -
>
> Key: LUCENE-2779
> URL: https://issues.apache.org/jira/browse/LUCENE-2779
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2779-backwardsfix.patch, LUCENE-2779.patch, 
> LUCENE-2779.patch, TestCHM.java
>
>
> RAMDirectory synchronizes on its instance in many places to protect access to 
> map of RAMFiles, in addition to updating the sizeInBytes member. In many 
> places the sync is done for 'read' purposes, while only in few places we need 
> 'write' access. This looks like a perfect use case for ConcurrentHashMap
> Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong 
> ...
> I'll post a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2779) Use ConcurrentHashMap in RAMDirectory

2010-11-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964920#action_12964920
 ] 

Earwin Burrfoot commented on LUCENE-2779:
-

I don't believe cloning the keySet() will make it more "snapshotty". There's no 
way to get a completely consistent snapshot of some concurrent datastructure 
without locking it completely, or using a variant of copy-on-write approach.

> Use ConcurrentHashMap in RAMDirectory
> -
>
> Key: LUCENE-2779
> URL: https://issues.apache.org/jira/browse/LUCENE-2779
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Store
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2779.patch, LUCENE-2779.patch, TestCHM.java
>
>
> RAMDirectory synchronizes on its instance in many places to protect access to 
> map of RAMFiles, in addition to updating the sizeInBytes member. In many 
> places the sync is done for 'read' purposes, while only in few places we need 
> 'write' access. This looks like a perfect use case for ConcurrentHashMap
> Also, syncing around sizeInBytes is unnecessary IMO, since it's an AtomicLong 
> ...
> I'll post a patch shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2781) Drop deprecations from trunk

2010-11-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964755#action_12964755
 ] 

Earwin Burrfoot commented on LUCENE-2781:
-

Ok, restore it then and fix deprecation note to "will drop it in Lucene 5.0" or 
something?

> Drop deprecations from trunk
> 
>
> Key: LUCENE-2781
> URL: https://issues.apache.org/jira/browse/LUCENE-2781
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 4.0
>Reporter: Earwin Burrfoot
>Assignee: Robert Muir
>Priority: Minor
> Attachments: drop-deprecations.patch, drop-deprecations.patch, 
> drop-deprecations.patch, drop-deprecations.patch
>
>
> subj.
> Also, to each remaining deprecation add release version when it first 
> appeared.
> Patch incoming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2781) Drop deprecations from trunk

2010-11-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964751#action_12964751
 ] 

Earwin Burrfoot commented on LUCENE-2781:
-

Hmm.. and regarding this exact case.

DutchStemFilter.java had this:
bq. @deprecated Use {...@link SnowballFilter} with {...@link 
org.tartarus.snowball.ext.DutchStemmer} instead, which has the same 
functionality.
considering the following snippet from DutchStemmer.java:
bq. The algorithm is an implementation of the http://snowball.tartarus.org/algorithms/dutch/stemmer.html";>dutch 
stemming algorithm in Martin Porter's snowball project.
we can reliably drop DutchStemFilter+DutchStemmer, people will fix their code 
to use SnowballFilter and happily read their 3.0 indexes.
No?

> Drop deprecations from trunk
> 
>
> Key: LUCENE-2781
> URL: https://issues.apache.org/jira/browse/LUCENE-2781
> Project: Lucene - Java
>  Issue Type: Task
>Affects Versions: 4.0
>Reporter: Earwin Burrfoot
>Assignee: Robert Muir
>Priority: Minor
> Attachments: drop-deprecations.patch, drop-deprecations.patch, 
> drop-deprecations.patch, drop-deprecations.patch
>
>
> subj.
> Also, to each remaining deprecation add release version when it first 
> appeared.
> Patch incoming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

1 2 >

1 - 100 of 167 matches

Mail list logo