[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-12 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789850#action_12789850
 ] 

Shai Erera commented on LUCENE-2126:


bq. I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream 

I actually use ByteBuffer which has similar methods. That's good though if you 
know the size of the needed byte[] up front. Otherwise, you either code the 
extension of growing the ByteBuffer, or use 
DataOutputStream(ByteArrayOutputStream).

Michael, I read through the patch (briefly though), and I was confused by the 
names DataInput/Ouput. Initially, when I read this issue, I thought you mean 
that IndexInput/Output should implement Java's DataInput/Output, but now I see 
you created two new such classes. So first, can we perhaps name them otherwise, 
like LuceneInput/Output or something similar, to not confuse w/ Java's? Second, 
why not have them implement Java's DataInput/Output, and add on top of them 
additional methods, like readVInt(), readVLong() etc.? You can keep the 
abstracts LuceneInput/Output to provide the common implementation.

BTW, a small optimization that I think can be made in the classes is to 
introduce an internal ByteBuffer of size 8. In the methods like readInt(), you 
can read 4 bytes into the buffer, calling readBytes(buf.array(), 0, 4), and 
then buf.getInt(). That will save 4 calls to readByte(). Same will go for long, 
and the write variants. Doesn't work though w/ readVInt(), because we need to 
read 1-byte at-a-time to decode. Maybe if the use of these is usually through 
BufferedIndexInput/Output this does not matter much, but it will still save 2/4 
method calls.

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Hudson build is back to normal: Lucene-trunk #1026

2009-12-12 Thread Apache Hudson Server
See 



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-2122) Use JUnit4 capabilites for more thorough Locale testing for classes deriving from LocalizedTestCase

2009-12-12 Thread Erick Erickson
Robert:

The -r4 patch runs for you and you want me to look at your patch compared to
r4? Sure, I'll do that, but not til tomorrow, I do much better work when I'm
not tired .

I confess I haven't looked at your patch beyond installing it to see if I
could reproduce the failure (looks like our emails crossed). But it's
*still* peculiar that it behaves differently between our two machines. OTOH,
maybe your patch will fail on my machine sometime tonight, my 4 successes
aren't very statistically significant after all..

Erick

On Sat, Dec 12, 2009 at 9:14 PM, Robert Muir (JIRA)  wrote:

>
>[
> https://issues.apache.org/jira/browse/LUCENE-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789837#action_12789837]
>
> Robert Muir commented on LUCENE-2122:
> -
>
> btw, I left 'ant clean test' running in a loop and just checked it with
> this patch, no problems.
> so perhaps its my own incompetence. Erick can you take a look? Do you see
> some obvious problem?
>
>
> > Use JUnit4 capabilites for more thorough Locale testing for classes
> deriving from LocalizedTestCase
> >
> ---
> >
> > Key: LUCENE-2122
> > URL: https://issues.apache.org/jira/browse/LUCENE-2122
> > Project: Lucene - Java
> >  Issue Type: Improvement
> >  Components: Other
> >Affects Versions: 3.1
> >Reporter: Erick Erickson
> >Priority: Minor
> > Fix For: 3.1
> >
> > Attachments: LUCENE-2122-r2.patch, LUCENE-2122-r3.patch,
> LUCENE-2122-r4.patch, LUCENE-2122.patch, LUCENE-2122.patch
> >
> >
> > Use the @Parameterized capabilities of Junit4 to allow more extensive
> testing of Locales.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: [jira] Updated: (LUCENE-2122) Use JUnit4 capabilites for more thorough Locale testing for classes deriving from LocalizedTestCase

2009-12-12 Thread Erick Erickson
H, you can't get either patch to work reliably.
On the other hand, I can't get either patch to fail.
I ran the whole ant clean test thing half a dozen times.
I'll make a script to loop all night tonight and we'll see.
I also ran just the TestQueryParser around 700 times
from Ant via a shell script. No problems. No problems
in IntelliJ. Siiggghhh.

Anybody else want to try applying either patch and see
what happens? I'd hate to lose the capabilities of the
Parameterized tests because of a gremlin that only exists
on Robert's machine. I'd also hate to introduce "cool new
capabilities" that started training us to ignore test failures.
That's bad. Very bad.

Robert: What kind of machine are you running on? I'm running
on a Macbook Pro...

As it stands, I'm not sure whether parameterized tests are
the issue or whether the issue is Locale testing. Or whether
Robert has some peculiar setup. Or, for that matter, whether
I have some peculiar setup that makes it work by hiding an
instability. It sure would be nice to figure out where the
fragility is before relying on Parameterized tests...

Robert:
If you have the patience, could you try your patch out and
capture the failure? I'm especially curious if your patch
fails on the same language every time. Who knows? On
your machine, this *could* be hitting an edge case, that's
actually a flaw in the code somewhere rather than an artifact
of the test framework. I don't even know if my machine
is using all of the same Locale's as yours

I'd have at figuring out what was going on, but I can't make
it fail. "It works on my machine" doesn't leave me very many
directions forward

But I'm so glad that Robert is finding this nonsense
*before* we get too much farther down this road rather than
after

I'll poke around on the internet and see if there's anything there
that I can see.

Erick

On Sat, Dec 12, 2009 at 8:55 AM, Robert Muir (JIRA)  wrote:

>
> [
> https://issues.apache.org/jira/browse/LUCENE-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Robert Muir updated LUCENE-2122:
> 
>
>Assignee: (was: Robert Muir)
>
> i am unassigning in case someone else can figure this one out, at my wits
> end here :)
> perhaps its just something wierd about my environment or something
>
> > Use JUnit4 capabilites for more thorough Locale testing for classes
> deriving from LocalizedTestCase
> >
> ---
> >
> > Key: LUCENE-2122
> > URL: https://issues.apache.org/jira/browse/LUCENE-2122
> > Project: Lucene - Java
> >  Issue Type: Improvement
> >  Components: Other
> >Affects Versions: 3.1
> >Reporter: Erick Erickson
> >Priority: Minor
> > Fix For: 3.1
> >
> > Attachments: LUCENE-2122-r2.patch, LUCENE-2122-r3.patch,
> LUCENE-2122-r4.patch, LUCENE-2122.patch, LUCENE-2122.patch
> >
> >
> > Use the @Parameterized capabilities of Junit4 to allow more extensive
> testing of Locales.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Commented: (LUCENE-2122) Use JUnit4 capabilites for more thorough Locale testing for classes deriving from LocalizedTestCase

2009-12-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789837#action_12789837
 ] 

Robert Muir commented on LUCENE-2122:
-

btw, I left 'ant clean test' running in a loop and just checked it with this 
patch, no problems.
so perhaps its my own incompetence. Erick can you take a look? Do you see some 
obvious problem?


> Use JUnit4 capabilites for more thorough Locale testing for classes deriving 
> from LocalizedTestCase
> ---
>
> Key: LUCENE-2122
> URL: https://issues.apache.org/jira/browse/LUCENE-2122
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
>Reporter: Erick Erickson
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2122-r2.patch, LUCENE-2122-r3.patch, 
> LUCENE-2122-r4.patch, LUCENE-2122.patch, LUCENE-2122.patch
>
>
> Use the @Parameterized capabilities of Junit4 to allow more extensive testing 
> of Locales.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834
 ] 

Michael Busch commented on LUCENE-2126:
---

I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandling vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?

> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader/ByteSliceWriter to extend
> DataInput/DataOutput. Previously ByteSliceReader implemented the
> methods that stay in IndexInput by throwing RuntimeExceptions.
> See also LUCENE-2125.
> All tests pass.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2126) Split up IndexInput and IndexOutput into DataInput and DataOutput

2009-12-12 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789834#action_12789834
 ] 

Michael Busch edited comment on LUCENE-2126 at 12/13/09 1:22 AM:
-

I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandle vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?

  was (Author: michaelbusch):
I disagree with you here: introducing DataInput/Output makes IMO the API 
actually easier for the "normal" user to understand.

I would think that most users don't implement IndexInput/Output extensions, but 
simply use the out-of-the-box Directory implementations, which provide 
IndexInput/Output impls. Also, most users probably don't even call the 
IndexInput/Output APIs directly. 

{quote}
Do nothing and assume that the sort of advanced user who writes a posting
codec won't do something incredibly stupid like call indexInput.close().
{quote}

Writing a posting code is much more advanced compared to using 2125's features. 
Ideally, a user who simply wants to store some specific information in the 
posting list, such as a boost, a part-of-speech identifier, another VInt, etc. 
should with 2125 only have to implement a new attribute including the 
serialize()/deserialize() methods. People who want to do that don't need to 
know anything about Lucene's API layer. They only need to know the APIs that 
DataInput/Output provide and will not get confused with methods like seek() or 
close(). For the standard user who only wants to write such an attribute it 
should not matter how Lucene's IO structure looks like - so even if we make 
changes that go into Lucy's direction in the future (IndexInput/Output owning a 
filehandling vs. the need to extend them) the serialize()/deserialize() methods 
of attribute would still work with DataInput/Output.

I bet that a lot of people who used the payload feature before took a 
ByteArrayOutputStream together with DataOutputStream (which implements Java's 
DataOutput) to populate the payload byte array. With 2125 Lucene will provide 
an API that is similar to use, but more efficient as it remove the byte[] array 
indirection and overhead.

I'm still +1 for this change. Others?
  
> Split up IndexInput and IndexOutput into DataInput and DataOutput
> -
>
> Key: LUCENE-2126
> URL: https://issues.apache.org/jira/browse/LUCENE-2126
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: lucene-2126.patch
>
>
> I'd like to introduce the two new classes DataInput and DataOutput
> that contain all methods from IndexInput and IndexOutput that actually
> decode or encode data, such as readByte()/writeByte(),
> readVInt()/writeVInt().
> Methods like getFilePointer(), seek(), close(), etc., which are not
> related to data encoding, but to files as input/output source stay in
> IndexInput/IndexOutput.
> This patch also changes ByteSliceReader

[jira] Updated: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2140:
--

Attachment: LUCENE-2140.patch

Patch with further access modifier changes and new method name & javadocs.

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch, LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789772#action_12789772
 ] 

Michael McCandless commented on LUCENE-2140:


bq. get/setMaxNonCompetitiveBoost() ?

+1

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789764#action_12789764
 ] 

Robert Muir commented on LUCENE-2140:
-

... get/setYouMustBeTallerThanThisToRide()


> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789762#action_12789762
 ] 

Uwe Schindler commented on LUCENE-2140:
---

get/setMaxNonCompetitiveBoost() ?

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789755#action_12789755
 ] 

Uwe Schindler commented on LUCENE-2140:
---

The problem with minCompetitiveBoost/minRequiredBoost is, that exactly that 
boost is not competitive, it must be slightly larger...

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789754#action_12789754
 ] 

Michael McCandless commented on LUCENE-2140:


minCompetitiveBoost?  minRequiredBoost?

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789751#action_12789751
 ] 

Uwe Schindler commented on LUCENE-2140:
---

Maybe the method should have a better name instead of min: This is not the 
minimum possible boost, that would go into the PQ, it is the largest boost that 
would not go into the PQ (so the check in the enum should be: accept term only 
if its boost is > boost hint).

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2140:
--

Attachment: LUCENE-2140.patch

Here the patch.

Robert: Is this what you need? Any better method names?

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2140.patch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-2140) TopTermsScoringBooleanQueryRewrite minscore

2009-12-12 Thread Uwe Schindler (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler reassigned LUCENE-2140:
-

Assignee: Uwe Schindler

> TopTermsScoringBooleanQueryRewrite minscore
> ---
>
> Key: LUCENE-2140
> URL: https://issues.apache.org/jira/browse/LUCENE-2140
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Assignee: Uwe Schindler
>Priority: Minor
> Fix For: Flex Branch
>
>
> when using the TopTermsScoringBooleanQueryRewrite (LUCENE-2123), it would be 
> nice if MultiTermQuery could set an attribute specifying the minimum required 
> score once the Priority Queue is filled. 
> This way, FilteredTermsEnums could adjust their behavior accordingly based on 
> the minimal score needed to actually be a useful term (i.e. not just pass 
> thru the pq)
> An example is FuzzyTermsEnum: at some point the bottom of the priority queue 
> contains words with edit distance of 1 and enumerating any further terms is 
> simply a waste of time.
> This is because terms are compared by score, then termtext. So in this case 
> FuzzyTermsEnum could simply seek to the exact match, then end.
> This behavior could be also generalized for all n, for a different impl of 
> fuzzyquery where it is only looking in the term dictionary for words within 
> edit distance of n' which is the lowest scoring term in the pq (they adjust 
> their behavior during enumeration of the terms depending upon this attribute).
> Other FilteredTermsEnums could make use of this minimal score in their own 
> way, to drive the most efficient behavior so that they do not waste time 
> enumerating useless terms.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2122) Use JUnit4 capabilites for more thorough Locale testing for classes deriving from LocalizedTestCase

2009-12-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2122:


Assignee: (was: Robert Muir)

i am unassigning in case someone else can figure this one out, at my wits end 
here :)
perhaps its just something wierd about my environment or something

> Use JUnit4 capabilites for more thorough Locale testing for classes deriving 
> from LocalizedTestCase
> ---
>
> Key: LUCENE-2122
> URL: https://issues.apache.org/jira/browse/LUCENE-2122
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
>Reporter: Erick Erickson
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2122-r2.patch, LUCENE-2122-r3.patch, 
> LUCENE-2122-r4.patch, LUCENE-2122.patch, LUCENE-2122.patch
>
>
> Use the @Parameterized capabilities of Junit4 to allow more extensive testing 
> of Locales.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2122) Use JUnit4 capabilites for more thorough Locale testing for classes deriving from LocalizedTestCase

2009-12-12 Thread Robert Muir (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2122:


Attachment: LUCENE-2122.patch

Hi Erick, I spent some time with this patch today and here is what happened:

1. i tried to simplify localizedtestcase, so that it works just like the old 
one, with the exception of using junit4 parameterized facility.
2. i wrote some bad tests and ensured things worked well such as error 
messages, default locale being run first, etc etc.
3. i had everything good to go when i got random failures again, this time from 
'ant clean test' about 3 times (pass,fail,pass)
(sorry i should have done something to capture each test log but i did not)

because the only real change here is use of the parameterized facility (the 
logic is the same), it makes me think that we should stick with .runBare() for 
the time being, because there is something strange going on here and I'm not 
even trying to break it.

attached is the modified version of your patch.

> Use JUnit4 capabilites for more thorough Locale testing for classes deriving 
> from LocalizedTestCase
> ---
>
> Key: LUCENE-2122
> URL: https://issues.apache.org/jira/browse/LUCENE-2122
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
>Reporter: Erick Erickson
>Assignee: Robert Muir
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2122-r2.patch, LUCENE-2122-r3.patch, 
> LUCENE-2122-r4.patch, LUCENE-2122.patch, LUCENE-2122.patch
>
>
> Use the @Parameterized capabilities of Junit4 to allow more extensive testing 
> of Locales.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: nightly build deploy to Maven repositories

2009-12-12 Thread Sanne Grinovero
I would be happy with 3.0.1-SNAPSHOT too, that will also fix my problem.
Will I have to wait for next release before I can share my patches?

Best Regards,
Sanne Grinovero

2009/12/3 Sanne Grinovero :
> Hello,
> I'm needing to depend on some recently committed bugfix from Lucene's
> 2.9 branch in other OSS projects, using Maven2 for dependency
> management.
>
> Are there snapshots uploaded somewhere regularly? Could Hudson do that?
> Looking into Hudson it appears that it regularly builds trunk;
> wouldn't it be a good idea to have him also verify the 2.9 branch
> until it's actively updated?
>
> Regards,
> Sanne
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789716#action_12789716
 ] 

Michael McCandless commented on LUCENE-2026:


bq. Zoie is a completely user-land solution which modifies no IW/IR internals 
and yet achieves millisecond index-to-query-visibility turnaround while keeping 
speedy indexing and query performance. It just keeps the RAMDir outside 
encapsulated in an object (an IndexingSystem) which has IndexReaders built off 
of both the RAMDir and the FSDir, and hides the implementation details (in fact 
the IW itself) from the user.

Right, one can always not use NRT and build their own layers on top.

But, Zoie has *alot* of code to accomplish this -- the devil really is
in the details to "simply write first to a RAMDir".  This is why I'd
like Earwin to look @ Zoie and clarify his proposed approach, in
contrast...

Actually, here's a question: how quickly can Zoie turn around a
commit()?  Seems like it must take more time than Lucene, since it does
extra stuff (flush RAM buffers to disk, materialize deletes) before
even calling IW.commit.

At the end of the day, any NRT system has to trade safety for
performance (bypass the sync call in the NRT reader)

bq. The API for this kind of thing doesn't have to be tightly coupled, and I 
would agree with you that it shouldn't be.

I don't consider NRT today to be a tight coupling (eg, the pending
refactoring of IW would nicely separate it out).  If we implement the
IR that searches DW's RAM buffer, then I'd agree ;)


> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789714#action_12789714
 ] 

Michael McCandless commented on LUCENE-2026:


{quote}
> I say it's better to sacrifice write guarantee.

I don't grok why sync is the default, especially given how sketchy hardware 
drivers are about obeying fsync:

{panel}
But, beware: some hardware devices may in fact cache writes even during 
fsync, and return before the bits are actually on stable storage, to give the 
appearance of faster performance.
{panel}
{quote}

It's unclear how often this scare-warning is true in practice (scare
warnings tend to spread very easily without concrete data); it's in
the javadocs for completeness sake.  I expect (though have no data to
back this up...) that most OS/IO systems "out there" do properly
implement fsync.

{quote}
IMO, it should have been an option which defaults to false, to be enabled only 
by 
users who have the expertise to ensure that fsync() is actually doing what 
it advertises. But what's done is done (and Lucy will probably just do 
something 
different.)
{quote}

I think that's a poor default (trades safety for performance), unless
Lucy eg uses a transaction log so you can concretely bound what's lost
on crash/power loss.  Or, if you go back to autocommitting I guess...

If we did this in Lucene, you can have unbounded corruption.  It's not
just the last few minutes of updates...

So, I don't think we should even offer the option to turn it off.  You
can easily subclass your FSDir impl and make sync() a no-op if your
really want to...

{quote}
With regard to Lucene NRT, though, turning sync() off would really help. If and 
when some sort of settings class comes about, an enableSync(boolean enabled) 
method seems like it would come in handy.
{quote}

You don't need to turn off sync for NRT -- that's the whole point.  It
gives you a reader without syncing the files.  Really, this is your
safety tradeoff -- it means you can commit less frequently, since the
NRT reader can search the latest updates.  But, your app has
complete control over how it wants to to trade safety for performance.


> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> I've been thinking for a while about refactoring the IndexWriter into
> two main components.
> One could be called a SegmentWriter and as the
> name says its job would be to write one particular index segment. The
> default one just as today will provide methods to add documents and
> flushes when its buffer is full.
> Other SegmentWriter implementations would do things like e.g. appending or
> copying external segments [what addIndexes*() currently does].
> The second component's job would it be to manage writing the segments
> file and merging/deleting segments. It would know about
> DeletionPolicy, MergePolicy and MergeScheduler. Ideally it would
> provide hooks that allow users to manage external data structures and
> keep them in sync with Lucene's data during segment merges.
> API wise there are things we have to figure out, such as where the
> updateDocument() method would fit in, because its deletion part
> affects all segments, whereas the new document is only being added to
> the new segment.
> Of course these should be lower level APIs for things like parallel
> indexing and related use cases. That's why we should still provide
> easy to use APIs like today for people who don't need to care about
> per-segment ops during indexing. So the current IndexWriter could
> probably keeps most of its APIs and delegate to the new classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2026) Refactoring of IndexWriter

2009-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789708#action_12789708
 ] 

Michael McCandless commented on LUCENE-2026:


{quote}
bq. Until you need to spillover to disk because your RAM buffer is full?

No, buffer is there only to decouple indexing from writing. Can be spilt over 
asynchronously without waiting for it to be filled up.
{quote}

But this is where things start to get complex... the devil is in the
details here.  How do you carry over your deletes?  This spillover
will take time -- do you block all indexing while that's happening
(not great)?  Do you do it gradually (start spillover when half full,
but still accept indexing)?  Do you throttle things if index rate
exceeds flush rate?  How do you recover on exception?

NRT today let's the OS's write cache decide how to use RAM to speed up
writing of these small files, which keeps things alot simpler for us.
I don't see why we should add complexity to Lucene to replicate what
the OS is doing for us (NOTE: I don't really trust the OS in the
reverse case... I do think Lucene should read into RAM the data
structures that are important).

bq. You decide to sacrifice new record (in)visibility. No choice, but to hack 
into IW to allow readers see its hot, fresh innards.

bq. Now you don't have to hack into IW and write specialized readers.

Probably we'll just have to disagree here... NRT isn't a hack ;)

IW is already hanging onto completely normal segments.  Ie, the index
has been updated with these segments, just not yet published so
outside readers can see it.  All NRT does is let a reader see this
private view.

The readers that an NRT reader expoes are normal SegmentReaders --
it's just that rather than consult a segments_N on disk to get the
segment metadata, they pulled from IW's uncommitted in memory
SegmentInfos instance.

Yes we've talked about the "hot innards" solution -- an IndexReader
impl that can directly search DW's ram buffer -- but that doesn't look
necessary today, because performance of NRT is good with the simple
solution we have now.

NRT reader also gains performance by carrying over deletes in RAM.  We
should eventually do the same thing with norms & field cache.  No
reason to write to disk, then right away read again.

{quote}
* You index docs, nobody sees them, nor deletions.
* You call commit(), the docs/deletes are written down to memory (NRT 
case)/disk (non-NRT case). Right after calling commit() every newly reopened 
Reader is guaranteed to see your docs/deletes.
* Background thread does write-to-disk+sync(NRT case)/just sync (non-NRT case), 
and fires up the Future returned from commit(). At this point all data is 
guaranteed to be written and braced for a crash, ram cache or not, OS/raid 
controller cache or not.
{quote}

But this is not a commit, if docs/deletes are written down into RAM?
Ie, commit could return, then the machine could crash, and you've lost
changes?  Commit should go through to stable storage before returning?
Maybe I'm just missing the big picture of what you're proposing
here...

Also, you can build all this out on top of Lucene today?  Zoie is a
proof point of this.  (Actually: how does your proposal differ from
Zoie?  Maybe that'd help shed light...).

bq. I say it's better to sacrifice write guarantee. In the rare case the 
process/machine crashes, you can reindex last few minutes' worth of docs. 

It is not that simple -- if you skip the fsync, and OS crashes/you
lose power, your index can easily become corrupt.  The resulting
CheckIndex -fix can easily need to remove large segments.

The OS's write cache makes no gurantees on the order in which the
files you've written find their way to disk.

Another option (we've discussed this) would be journal file approach
(ie transaction log, like most DBs use).  You only have one file to
fsync, and you replay to recover.  But that'd be a big change for
Lucene, would add complexity, and can be accomplished outside of
Lucene if an app really wants to...

Let me try turning this around: in your componentization of
SegmentReader, why does it matter who's tracking which components are
needed to make up a given SR?  In the IndexReader.open case, it's a
SegmntInfos instance (obtained by loading segments_N file from disk).
In the NRT case, it's also a SegmentInfos instace (the one IW is
privately keeping track of and only publishing on commit).  At the
component level, creating the SegmentReader should be no different?


> Refactoring of IndexWriter
> --
>
> Key: LUCENE-2026
> URL: https://issues.apache.org/jira/browse/LUCENE-2026
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
>

Re: Build failed in Hudson: Lucene-trunk #1025

2009-12-12 Thread Michael McCandless
I think this is a test error.

It simulates disk full, with multiple threads indexing, confirming no
deadlock occurs.

Then it closes the writer, suppressing any IOException (eg because a
disk full was hit trying to write the new segment), then tries to
close the MRD.  But if we hit an IOException during IW.close, files
could in fact still be left open.

I think before close we should call dir.setMaxSizeInBytes(0), then,
don't suppress IOException in IW.close. I'll commit...

Mike

On Sat, Dec 12, 2009 at 3:20 AM, Uwe Schindler  wrote:
> This one failed in the last test-tag run with clover:
>
>    [junit] Testcase:
> testImmediateDiskFullWithThreads(org.apache.lucene.index.TestIndexWriter):
> Caused an ERROR
>    [junit] MockRAMDirectory: cannot close: there are still open files:
> {_6.cfs=1, _5.cfs=1, _4.cfs=1, _7.cfs=1}
>    [junit] java.lang.RuntimeException: MockRAMDirectory: cannot close:
> there are still open files: {_6.cfs=1, _5.cfs=1, _4.cfs=1, _7.cfs=1}
>    [junit]     at
> org.apache.lucene.store.MockRAMDirectory.close(MockRAMDirectory.java:273)
>    [junit]     at
> org.apache.lucene.index.TestIndexWriter.testImmediateDiskFullWithThreads(Tes
> tIndexWriter.java:2374)
>    [junit]     at
> org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208)
>
> The run before went ok (core w + wo clover, test-tag wo clover).
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>> -Original Message-
>> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
>> Sent: Saturday, December 12, 2009 4:42 AM
>> To: java-dev@lucene.apache.org
>> Subject: Build failed in Hudson: Lucene-trunk #1025
>>
>> See 
>>
>> Changes:
>>
>> [mikemccand] LUCENE-2135: forcefully evict IndexReader from FieldCache
>> when it's closed
>>
>> [kalle] LUCENE-2144
>> No testing of features outside of the documented API
>>
>> [uschindler] LUCENE-2123: Remove the deprec ScoreTerm in FuzzyQuery,
>> javadocs, better test for the PQ overflow
>>
>> [uschindler] fix unchecked warning
>>
>> [mikemccand] LUCENE-2142: don't check if term count exceeds doc count in
>> getStringIndex
>>
>> --
>> [...truncated 27971 lines...]
>>     [junit] -  ---
>>     [junit] Testsuite: org.apache.lucene.search.TestSetNorm
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.277 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestSimilarity
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.317 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations
>>     [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 23.97 sec
>>     [junit]
>>     [junit] Testsuite:
>> org.apache.lucene.search.TestSimpleExplanationsOfNonMatches
>>     [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 3.409 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestSloppyPhraseQuery
>>     [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 6.937 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestSort
>>     [junit] Tests run: 22, Failures: 0, Errors: 0, Time elapsed: 10.618
>> sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestSpanQueryFilter
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.21 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTermRangeFilter
>>     [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 7.397 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTermRangeQuery
>>     [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 1.439 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTermScorer
>>     [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.112 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTermVectors
>>     [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 5.282 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestThreadSafe
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 14.606 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTimeLimitingCollector
>>     [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 8.57 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTopDocsCollector
>>     [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.945 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestTopScoreDocCollector
>>     [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.947 sec
>>     [junit]
>>     [junit] Testsuite: org.apache.lucene.search.TestWildcard
>>     [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.201 sec
>>     [junit]

RE: Build failed in Hudson: Lucene-trunk #1025

2009-12-12 Thread Uwe Schindler
This one failed in the last test-tag run with clover:

[junit] Testcase:
testImmediateDiskFullWithThreads(org.apache.lucene.index.TestIndexWriter):
Caused an ERROR
[junit] MockRAMDirectory: cannot close: there are still open files:
{_6.cfs=1, _5.cfs=1, _4.cfs=1, _7.cfs=1}
[junit] java.lang.RuntimeException: MockRAMDirectory: cannot close:
there are still open files: {_6.cfs=1, _5.cfs=1, _4.cfs=1, _7.cfs=1}
[junit] at
org.apache.lucene.store.MockRAMDirectory.close(MockRAMDirectory.java:273)
[junit] at
org.apache.lucene.index.TestIndexWriter.testImmediateDiskFullWithThreads(Tes
tIndexWriter.java:2374)
[junit] at
org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208)

The run before went ok (core w + wo clover, test-tag wo clover).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Apache Hudson Server [mailto:hud...@hudson.zones.apache.org]
> Sent: Saturday, December 12, 2009 4:42 AM
> To: java-dev@lucene.apache.org
> Subject: Build failed in Hudson: Lucene-trunk #1025
> 
> See 
> 
> Changes:
> 
> [mikemccand] LUCENE-2135: forcefully evict IndexReader from FieldCache
> when it's closed
> 
> [kalle] LUCENE-2144
> No testing of features outside of the documented API
> 
> [uschindler] LUCENE-2123: Remove the deprec ScoreTerm in FuzzyQuery,
> javadocs, better test for the PQ overflow
> 
> [uschindler] fix unchecked warning
> 
> [mikemccand] LUCENE-2142: don't check if term count exceeds doc count in
> getStringIndex
> 
> --
> [...truncated 27971 lines...]
> [junit] -  ---
> [junit] Testsuite: org.apache.lucene.search.TestSetNorm
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.277 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestSimilarity
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.317 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestSimpleExplanations
> [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 23.97 sec
> [junit]
> [junit] Testsuite:
> org.apache.lucene.search.TestSimpleExplanationsOfNonMatches
> [junit] Tests run: 53, Failures: 0, Errors: 0, Time elapsed: 3.409 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestSloppyPhraseQuery
> [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 6.937 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestSort
> [junit] Tests run: 22, Failures: 0, Errors: 0, Time elapsed: 10.618
> sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestSpanQueryFilter
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.21 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTermRangeFilter
> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 7.397 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTermRangeQuery
> [junit] Tests run: 9, Failures: 0, Errors: 0, Time elapsed: 1.439 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTermScorer
> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 1.112 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTermVectors
> [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 5.282 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestThreadSafe
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 14.606 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTimeLimitingCollector
> [junit] Tests run: 6, Failures: 0, Errors: 0, Time elapsed: 8.57 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTopDocsCollector
> [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.945 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestTopScoreDocCollector
> [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.947 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.TestWildcard
> [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 1.201 sec
> [junit]
> [junit] Testsuite:
> org.apache.lucene.search.function.TestCustomScoreQuery
> [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 17.04 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.function.TestDocValues
> [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.421 sec
> [junit]
> [junit] Testsuite:
> org.apache.lucene.search.function.TestFieldScoreQuery
> [junit] Tests run: 12, Failures: 0, Errors: 0, Time elapsed: 3.121 sec
> [junit]
> [junit] Testsuite: org.apache.lucene.search.function.TestOrdValues
> [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 1.939 s