date:20091203

[jira] Assigned: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2108:
---

Assignee: Simon Willnauer  (was: Michael McCandless)

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Simon Willnauer
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2065) Java 5 port phase II

2009-12-03 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-2065:


Attachment: (was: LUCENE-2065.patch)

> Java 5 port phase II 
> -
>
> Key: LUCENE-2065
> URL: https://issues.apache.org/jira/browse/LUCENE-2065
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.1
> Environment: Java 5 
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2065.patch, LUCENE-2065.patch
>
>
> LUCENE-1257 addresses the public API changes ( generics , mainly ) and other 
> j.u.c. package changes related to the API .  The changes are frozen and 
> closed for 3.0 . This would be a placeholder JIRA for 3.0+ version to address 
> the pending changes ( tests for generics etc.) and any other internal API 
> changes as necessary. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2065) Java 5 port phase II

2009-12-03 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-2065:


Attachment: LUCENE-2065.patch

Revised patch in sync with trunk that addresses more files in src/test 

> Java 5 port phase II 
> -
>
> Key: LUCENE-2065
> URL: https://issues.apache.org/jira/browse/LUCENE-2065
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.1
> Environment: Java 5 
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2065.patch, LUCENE-2065.patch
>
>
> LUCENE-1257 addresses the public API changes ( generics , mainly ) and other 
> j.u.c. package changes related to the API .  The changes are frozen and 
> closed for 3.0 . This would be a placeholder JIRA for 3.0+ version to address 
> the pending changes ( tests for generics etc.) and any other internal API 
> changes as necessary. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2009-12-03 Thread Kay Kay (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785715#action_12785715
 ] 

Kay Kay commented on LUCENE-2111:
-

What would the branch name for flex indexing ?  

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Erick Erickson

Mike:

I should be able to create a new 2037 patch pretty easily if you
want to apply 2065 first. Let me know

Erick

On Thu, Dec 3, 2009 at 9:05 PM, Kay Kay  wrote:

> Mike -
> I have attached another patch to LUCENE-2065 , in sync with the trunk now.
>
>
>
>
> Erick Erickson wrote:
>
>> That's up to Mike, whichever way he finds easiest, I'll deal.
>>
>> Erick
>>
>> On Thu, Dec 3, 2009 at 8:43 PM, Kay Kay > kaykay.uni...@gmail.com>> wrote:
>>
>>I created Lucene-2065 while working on 1257 , the original
>>generics related ticket , and since we were running out of time
>>for 3.0 ,  I guess we could not get src/test converted in.
>>
>>In any case , if you were comitting this one (2037) to trunk ,
>> may be I can wait before creating the patch again.
>>
>>
>>
>>
>>Erick Erickson wrote:
>>
>>I didn't realize 2065 had already been down this path, thought
>>you were volunteering to change all the code starting from
>>scratch. Your approach sounds like a fine plan.
>>
>>Note that I'm not entirely sure that I cleaned up
>>*everything*, but we
>>need to get to a known state before tackling the rest, so I'll
>>wait for
>>these two patches to be applied before looking back at it...
>>
>>Not to mention the Localized test thing.
>>
>>Erick
>>
>>
>>On Thu, Dec 3, 2009 at 5:57 PM, Michael McCandless
>>mailto:luc...@mikemccandless.com>
>>>
>>>> wrote:
>>
>>   On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson
>>   mailto:erickerick...@gmail.com>
>>>
>>>> wrote:
>>   > I generified the searches/function files in patch 2037. I
>>don't
>>   really think
>>   > there's a conflict, just commit my patch and have at
>>generifying
>>   the rest.
>>
>>   OK so then we'll start with 2037, then take 2065's patch,
>>hopefully
>>   updated to current trunk, but minus search/function sources.
>>
>>   > I know, I know. I did two things at once. So sue me. Honest,
>>   I'll try not to
>>   > do this very often ...
>>
>>   In fact I prefer this.  I used to think we shouldn't do
>>that but I
>>   flip-flopped and now think in practice you just have to
>>clean code
>>   while you're there, otherwise it won't get cleaned.
>>
>>   > Mike:
>>   > You really want to to the generify the whole shootin'
>>match or
>>   do you want
>>   > to partition them? I'll be happy to take a set of them.
>>Or would
>>   that make
>>   > things too complicated to apply?
>>
>>   2065 already has done alot here (adding generics to the
>>tests)... I
>>   think we start from that and take it from there?
>>
>>   Mike
>>
>>
>> -
>>   To unsubscribe, e-mail:
>>java-dev-unsubscr...@lucene.apache.org
>>
>>   >>
>>
>>   For additional commands, e-mail:
>>java-dev-h...@lucene.apache.org
>>
>>   >>
>>
>>
>>
>>
>>-
>>To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>
>>For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Kay Kay

Mike -
I have attached another patch to LUCENE-2065 , in sync with the trunk now.

Erick Erickson wrote:

That's up to Mike, whichever way he finds easiest, I'll deal.

Erick

On Thu, Dec 3, 2009 at 8:43 PM, Kay Kay > wrote:

I created Lucene-2065 while working on 1257 , the original
generics related ticket , and since we were running out of time
for 3.0 ,  I guess we could not get src/test converted in.

In any case , if you were comitting this one (2037) to trunk ,
 may be I can wait before creating the patch again.

Erick Erickson wrote:

I didn't realize 2065 had already been down this path, thought
you were volunteering to change all the code starting from
scratch. Your approach sounds like a fine plan.

Note that I'm not entirely sure that I cleaned up
*everything*, but we
need to get to a known state before tackling the rest, so I'll
wait for
these two patches to be applied before looking back at it...

Not to mention the Localized test thing.

Erick

On Thu, Dec 3, 2009 at 5:57 PM, Michael McCandless
mailto:luc...@mikemccandless.com>
>> wrote:

   On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson
   mailto:erickerick...@gmail.com>
>> wrote:
   > I generified the searches/function files in patch 2037. I
don't
   really think
   > there's a conflict, just commit my patch and have at
generifying
   the rest.

   OK so then we'll start with 2037, then take 2065's patch,
hopefully
   updated to current trunk, but minus search/function sources.

   > I know, I know. I did two things at once. So sue me. Honest,
   I'll try not to
   > do this very often ...

   In fact I prefer this.  I used to think we shouldn't do
that but I
   flip-flopped and now think in practice you just have to
clean code
   while you're there, otherwise it won't get cleaned.

   > Mike:
   > You really want to to the generify the whole shootin'
match or
   do you want
   > to partition them? I'll be happy to take a set of them.
Or would
   that make
   > things too complicated to apply?

   2065 already has done alot here (adding generics to the
tests)... I
   think we start from that and take it from there?

   Mike

 -

   To unsubscribe, e-mail:
java-dev-unsubscr...@lucene.apache.org

   >

   For additional commands, e-mail:
java-dev-h...@lucene.apache.org

   >

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2065) Java 5 port phase II

2009-12-03 Thread Kay Kay (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Kay updated LUCENE-2065:


Attachment: LUCENE-2065.patch

Patch revised to be in sync with the trunk. 



> Java 5 port phase II 
> -
>
> Key: LUCENE-2065
> URL: https://issues.apache.org/jira/browse/LUCENE-2065
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.1
> Environment: Java 5 
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2065.patch, LUCENE-2065.patch
>
>
> LUCENE-1257 addresses the public API changes ( generics , mainly ) and other 
> j.u.c. package changes related to the API .  The changes are frozen and 
> closed for 3.0 . This would be a placeholder JIRA for 3.0+ version to address 
> the pending changes ( tests for generics etc.) and any other internal API 
> changes as necessary. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Erick Erickson

That's up to Mike, whichever way he finds easiest, I'll deal.

Erick

On Thu, Dec 3, 2009 at 8:43 PM, Kay Kay  wrote:

> I created Lucene-2065 while working on 1257 , the original generics related
> ticket , and since we were running out of time for 3.0 ,  I guess we could
> not get src/test converted in.
>
> In any case , if you were comitting this one (2037) to trunk ,  may be I
> can wait before creating the patch again.
>
>
>
>
> Erick Erickson wrote:
>
>> I didn't realize 2065 had already been down this path, thought
>> you were volunteering to change all the code starting from
>> scratch. Your approach sounds like a fine plan.
>>
>> Note that I'm not entirely sure that I cleaned up *everything*, but we
>> need to get to a known state before tackling the rest, so I'll wait for
>> these two patches to be applied before looking back at it...
>>
>> Not to mention the Localized test thing.
>>
>> Erick
>>
>>
>> On Thu, Dec 3, 2009 at 5:57 PM, Michael McCandless <
>> luc...@mikemccandless.com > wrote:
>>
>>On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson
>>mailto:erickerick...@gmail.com>> wrote:
>>> I generified the searches/function files in patch 2037. I don't
>>really think
>>> there's a conflict, just commit my patch and have at generifying
>>the rest.
>>
>>OK so then we'll start with 2037, then take 2065's patch, hopefully
>>updated to current trunk, but minus search/function sources.
>>
>>> I know, I know. I did two things at once. So sue me. Honest,
>>I'll try not to
>>> do this very often ...
>>
>>In fact I prefer this.  I used to think we shouldn't do that but I
>>flip-flopped and now think in practice you just have to clean code
>>while you're there, otherwise it won't get cleaned.
>>
>>> Mike:
>>> You really want to to the generify the whole shootin' match or
>>do you want
>>> to partition them? I'll be happy to take a set of them. Or would
>>that make
>>> things too complicated to apply?
>>
>>2065 already has done alot here (adding generics to the tests)... I
>>think we start from that and take it from there?
>>
>>Mike
>>
>>-
>>To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>
>>
>>For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Kay Kay

I created Lucene-2065 while working on 1257 , the original generics 
related ticket , and since we were running out of time for 3.0 ,  I 
guess we could not get src/test converted in.

In any case , if you were comitting this one (2037) to trunk ,  may be I 
can wait before creating the patch again. 

Erick Erickson wrote:

I didn't realize 2065 had already been down this path, thought
you were volunteering to change all the code starting from
scratch. Your approach sounds like a fine plan.

Note that I'm not entirely sure that I cleaned up *everything*, but we
need to get to a known state before tackling the rest, so I'll wait for
these two patches to be applied before looking back at it...

Not to mention the Localized test thing.

Erick

On Thu, Dec 3, 2009 at 5:57 PM, Michael McCandless 
mailto:luc...@mikemccandless.com>> wrote:

On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson
mailto:erickerick...@gmail.com>> wrote:
> I generified the searches/function files in patch 2037. I don't
really think
> there's a conflict, just commit my patch and have at generifying
the rest.

OK so then we'll start with 2037, then take 2065's patch, hopefully
updated to current trunk, but minus search/function sources.

> I know, I know. I did two things at once. So sue me. Honest,
I'll try not to
> do this very often ...

In fact I prefer this.  I used to think we shouldn't do that but I
flip-flopped and now think in practice you just have to clean code
while you're there, otherwise it won't get cleaned.

> Mike:
> You really want to to the generify the whole shootin' match or
do you want
> to partition them? I'll be happy to take a set of them. Or would
that make
> things too complicated to apply?

2065 already has done alot here (adding generics to the tests)... I
think we start from that and take it from there?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785690#action_12785690
 ] 

Otis Gospodnetic commented on LUCENE-2091:
--

Joaquin - could you please explain what you mean by "Saturate the effect of 
frequency with k1"?  Thanks.

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Erick Erickson

I didn't realize 2065 had already been down this path, thought
you were volunteering to change all the code starting from
scratch. Your approach sounds like a fine plan.

Note that I'm not entirely sure that I cleaned up *everything*, but we
need to get to a known state before tackling the rest, so I'll wait for
these two patches to be applied before looking back at it...

Not to mention the Localized test thing.

Erick


On Thu, Dec 3, 2009 at 5:57 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson 
> wrote:
> > I generified the searches/function files in patch 2037. I don't really
> think
> > there's a conflict, just commit my patch and have at generifying the
> rest.
>
> OK so then we'll start with 2037, then take 2065's patch, hopefully
> updated to current trunk, but minus search/function sources.
>
> > I know, I know. I did two things at once. So sue me. Honest, I'll try not
> to
> > do this very often ...
>
> In fact I prefer this.  I used to think we shouldn't do that but I
> flip-flopped and now think in practice you just have to clean code
> while you're there, otherwise it won't get cleaned.
>
> > Mike:
> > You really want to to the generify the whole shootin' match or do you
> want
> > to partition them? I'll be happy to take a set of them. Or would that
> make
> > things too complicated to apply?
>
> 2065 already has done alot here (adding generics to the tests)... I
> think we start from that and take it from there?
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785656#action_12785656
 ] 

Michael McCandless commented on LUCENE-2111:


bq. The Generics policeman will visit them and will hopefully help fixing!

Uh-oh... I sense heavy committing in flex branch's future!

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2111) Wrapup flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785648#action_12785648
 ] 

Uwe Schindler commented on LUCENE-2111:
---

Also the current flex branch produces lots of unchecked warnings... The 
Generics policeman will visit them and will hopefully help fixing!

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785645#action_12785645
 ] 

Uwe Schindler commented on LUCENE-2110:
---

I will work on this tomorrow and provide a patch. I will also update the patch 
in LUCENE-1606 to move the initial seek out of ctor (its easy, see below).

The setEnum method should be renamed in something like setInitialTermRef(). So 
the default impl of next() will seek to the correct term and do not seek by 
default (iterate all terms of field).

> Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
> next() must be always called first. Remove empty()
> --
>
> Key: LUCENE-2110
> URL: https://issues.apache.org/jira/browse/LUCENE-2110
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
> Fix For: Flex Branch
>
>
> FilteredTermsEnum is confusing as it is initially positioned to the first 
> term. It should instead work like an uninitialized TermsEnum for a field 
> before the first call to next() or seek().
> Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
> or Automaton are not able to support this. Seeking is also not needed for MTQ 
> at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used

2009-12-03 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2106:


Fix Version/s: 3.1
   3.0.1

> Benchmark does not close its Reader when OpenReader/CloseReader are not used
> 
>
> Key: LUCENE-2106
> URL: https://issues.apache.org/jira/browse/LUCENE-2106
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2106.patch
>
>
> Only the Searcher is closed, but because the reader is passed to the 
> Searcher, the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used

2009-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785642#action_12785642
 ] 

Mark Miller commented on LUCENE-2106:
-

I'll commit this in a day or two.

> Benchmark does not close its Reader when OpenReader/CloseReader are not used
> 
>
> Key: LUCENE-2106
> URL: https://issues.apache.org/jira/browse/LUCENE-2106
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Mark Miller
>Assignee: Mark Miller
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2106.patch
>
>
> Only the Searcher is closed, but because the reader is passed to the 
> Searcher, the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Announcement: Boilerplate removal library

2009-12-03 Thread Christian Kohlschütter

Dear all,

I think the following announcement is of interest for the Lucene community.

Today I have released Boilerpipe 1.0.

Boilerpipe is a Java library for boilerplate removal and fulltext extraction 
from HTML pages.
It is based upon my paper "Boilerplate Detection using Shallow Text Features"  
to be presented at WSDM 2010 -- The Third ACM International Conference on Web 
Search and Data Mining, 3-6 February 2010, New York City, NY USA.

The boilerpipe library provides algorithms to detect and remove the surplus 
"clutter" (boilerplate, templates) around the main textual content of a 
website. It already provides specific strategies for common tasks (for example: 
news article extraction) and may also be easily extended for individual problem 
settings. Extracting content is very fast (milliseconds), just needs the input 
document (no global or site-level information required) and usually quite 
accurate.

You can find Boilerpipe at http://code.google.com/p/boilerpipe/

The code is released under the Apache 2.0 license and you are very welcomed to 
use Boilerpipe for whatever you like to. Please let me know if it helps you, if 
you have questions about it, difficulties with it or ideas how to improve it.

Cheers,
Christian
-- 
Christian Kohlschütter
kohlschuet...@l3s.de

Forschungszentrum L3S
Leibniz Universität Hannover

http://www.L3S.de/~kohlschuetter/


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2111) Wrapup flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2111:
--

Affects Version/s: Flex Branch

> Wrapup flexible indexing
> 
>
> Key: LUCENE-2111
> URL: https://issues.apache.org/jira/browse/LUCENE-2111
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: 3.1
>
>
> Spinoff from LUCENE-1458.
> The flex branch is in fairly good shape -- all tests pass, initial search 
> performance testing looks good, it survived several visits from the Unicode 
> policeman ;)
> But it still has a number of nocommits, could use some more scrutiny 
> especially on the "emulate old API on flex index" and vice/versa code paths, 
> and still needs some more performance testing.  I'll do these under this 
> issue, and we should open separate issues for other self contained fixes.
> The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael McCandless

Yeah it is quite hideously long... OK I'll open a "wrapup flex branch" issue :)

Mike

On Thu, Dec 3, 2009 at 6:07 PM, Michael Busch  wrote:
> I just suggested it, because 1458 got sooo long. We could have new issues
> for cleanup and merging back to trunk.
>
> I don't have a strong preference about leaving it open or not though.
>
>  Michael
>
> On 12/3/09 2:58 PM, Michael McCandless wrote:
>>
>> Yeah I would say we leave it open.
>>
>> There's also a good amount of fixing still (cleaning up the nocommits)
>> which likely should just go in under LUCENE-1458.
>>
>> Mike
>>
>> On Thu, Dec 3, 2009 at 5:55 PM, Mark Miller  wrote:
>>
>>>
>>> Why would we close it though? Doesn't it make sense to wait until its
>>> merged into trunk ...
>>>
>>> Michael Busch wrote:
>>>

 OK, done!

 I updated 1458 too. We can probably resolve that one now and open more
 specific issues.

  Michael

 On 12/3/09 12:14 PM, Michael McCandless wrote:

>
> +1, good idea!
>
> Mike
>
> On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch
> wrote:
>
>
>>
>> I was thinking we could create a new version in Jira for the flex
>> branch
>> (that's what Hadoop HDFS is doing with their append branch). Then we
>> can
>> open new Jira issues with fix version=flex branch. It's getting a bit
>> confusing to always use 1458 for all changes :)
>>
>>   Michael
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>
>
>

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org


>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1458.


Resolution: Fixed

We will continue work under new issues -- this one has gotten too big!

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2111) Wrapup flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)

Wrapup flexible indexing


 Key: LUCENE-2111
 URL: https://issues.apache.org/jira/browse/LUCENE-2111
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1


Spinoff from LUCENE-1458.

The flex branch is in fairly good shape -- all tests pass, initial search 
performance testing looks good, it survived several visits from the Unicode 
policeman ;)

But it still has a number of nocommits, could use some more scrutiny especially 
on the "emulate old API on flex index" and vice/versa code paths, and still 
needs some more performance testing.  I'll do these under this issue, and we 
should open separate issues for other self contained fixes.

The end is in sight!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael Busch

I just suggested it, because 1458 got sooo long. We could have new 
issues for cleanup and merging back to trunk.


I don't have a strong preference about leaving it open or not though.

 Michael

On 12/3/09 2:58 PM, Michael McCandless wrote:

Yeah I would say we leave it open.

There's also a good amount of fixing still (cleaning up the nocommits)
which likely should just go in under LUCENE-1458.

Mike

On Thu, Dec 3, 2009 at 5:55 PM, Mark Miller  wrote:
   

Why would we close it though? Doesn't it make sense to wait until its
merged into trunk ...

Michael Busch wrote:
 

OK, done!

I updated 1458 too. We can probably resolve that one now and open more
specific issues.

  Michael

On 12/3/09 12:14 PM, Michael McCandless wrote:
   

+1, good idea!

Mike

On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch
wrote:

 

I was thinking we could create a new version in Jira for the flex
branch
(that's what Hadoop HDFS is doing with their append branch). Then we
can
open new Jira issues with fix version=flex branch. It's getting a bit
confusing to always use 1458 for all changes :)

   Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



   

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



 


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

   


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785606#action_12785606
 ] 

Michael McCandless commented on LUCENE-2108:


bq. Just a reminder - we need to fix the CHANGES.TXT entry once this is done.

Simon how about you do this, and take this issue (to commit your improvement to 
throw ACE not NPE)?  Thanks ;)

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785605#action_12785605
 ] 

Michael McCandless commented on LUCENE-2108:


Eirik, could you open a new issue to address SpellChecker's non-thread-safety?  
I actually thing simply documenting clearly that it's not thread safe is fine.

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael Busch

You can also create a new issue "Merge flex branch into trunk" with fix 
version 3.1 ;)


 Michael

On 12/3/09 2:55 PM, Mark Miller wrote:

Why would we close it though? Doesn't it make sense to wait until its
merged into trunk ...

Michael Busch wrote:
   

OK, done!

I updated 1458 too. We can probably resolve that one now and open more
specific issues.

  Michael

On 12/3/09 12:14 PM, Michael McCandless wrote:
 

+1, good idea!

Mike

On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch
wrote:

   

I was thinking we could create a new version in Jira for the flex
branch
(that's what Hadoop HDFS is doing with their append branch). Then we
can
open new Jira issues with fix version=flex branch. It's getting a bit
confusing to always use 1458 for all changes :)

   Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



   


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

 


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael McCandless

Yeah I would say we leave it open.

There's also a good amount of fixing still (cleaning up the nocommits)
which likely should just go in under LUCENE-1458.

Mike

On Thu, Dec 3, 2009 at 5:55 PM, Mark Miller  wrote:
> Why would we close it though? Doesn't it make sense to wait until its
> merged into trunk ...
>
> Michael Busch wrote:
>> OK, done!
>>
>> I updated 1458 too. We can probably resolve that one now and open more
>> specific issues.
>>
>>  Michael
>>
>> On 12/3/09 12:14 PM, Michael McCandless wrote:
>>> +1, good idea!
>>>
>>> Mike
>>>
>>> On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch
>>> wrote:
>>>
 I was thinking we could create a new version in Jira for the flex
 branch
 (that's what Hadoop HDFS is doing with their append branch). Then we
 can
 open new Jira issues with fix version=flex branch. It's getting a bit
 confusing to always use 1458 for all changes :)

   Michael

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785603#action_12785603
 ] 

Eirik Bjorsnos commented on LUCENE-2108:


Mike,

Please account for my demonstrated stupidity when considering this suggestion 
for thread safety policy / goals:

1) Concurrent invocations of  suggestSimilar() should not interfere with each 
other.
2) An invocation of any of the write methods (setSpellIndex, clearIndex, 
indexDictionary) should not interfere with aleady invoked suggestSimilar
3) All calls to write methods should be serialized (We could probably 
synchronize these methods?)

If we synchronize any writes to the searcher reference, couldn't suggestSimilar 
just start its work by putting searcher in a local variable and use that 
instead of the field?

I guess concurrency is hard to get right.. 

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Michael McCandless

On Thu, Dec 3, 2009 at 5:48 PM, Erick Erickson  wrote:
> I generified the searches/function files in patch 2037. I don't really think
> there's a conflict, just commit my patch and have at generifying the rest.

OK so then we'll start with 2037, then take 2065's patch, hopefully
updated to current trunk, but minus search/function sources.

> I know, I know. I did two things at once. So sue me. Honest, I'll try not to
> do this very often ...

In fact I prefer this.  I used to think we shouldn't do that but I
flip-flopped and now think in practice you just have to clean code
while you're there, otherwise it won't get cleaned.

> Mike:
> You really want to to the generify the whole shootin' match or do you want
> to partition them? I'll be happy to take a set of them. Or would that make
> things too complicated to apply?

2065 already has done alot here (adding generics to the tests)... I
think we start from that and take it from there?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Mark Miller

Why would we close it though? Doesn't it make sense to wait until its
merged into trunk ...

Michael Busch wrote:
> OK, done!
>
> I updated 1458 too. We can probably resolve that one now and open more
> specific issues.
>
>  Michael
>
> On 12/3/09 12:14 PM, Michael McCandless wrote:
>> +1, good idea!
>>
>> Mike
>>
>> On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch 
>> wrote:
>>   
>>> I was thinking we could create a new version in Jira for the flex
>>> branch
>>> (that's what Hadoop HDFS is doing with their append branch). Then we
>>> can
>>> open new Jira issues with fix version=flex branch. It's getting a bit
>>> confusing to always use 1458 for all changes :)
>>>
>>>   Michael
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>  
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>


-- 
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: [jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Erick Erickson

I generified the searches/function files in patch 2037. I don't really think
there's a conflict, just commit my patch and have at generifying the rest.

I know, I know. I did two things at once. So sue me. Honest, I'll try not to
do this very often ...

Mike:
You really want to to the generify the whole shootin' match or do you want
to partition them? I'll be happy to take a set of them. Or would that make
things too complicated to apply?

Erick

On Thu, Dec 3, 2009 at 3:15 PM, Michael McCandless (JIRA)
wrote:

>
>[
> https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785479#action_12785479]
>
> Michael McCandless commented on LUCENE-2037:
> 
>
> bq.  but there is another patch - LUCENE-2065 to port the existing tests to
> Java 5 generics
>
> Ahh thanks for the reminder -- I can take this one as well, but, there will
> be conflicts b/w the two patches, I think.  Should we do the generics first
> (simpler change, but touches many files), and then the junit4 upgrade?
>
> > Allow Junit4 tests in our environment.
> > --
> >
> > Key: LUCENE-2037
> > URL: https://issues.apache.org/jira/browse/LUCENE-2037
> > Project: Lucene - Java
> >  Issue Type: Improvement
> >  Components: Other
> >Affects Versions: 3.1
> > Environment: Development
> >Reporter: Erick Erickson
> >Assignee: Michael McCandless
> >Priority: Minor
> > Fix For: 3.1
> >
> > Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
> >
> >   Original Estimate: 8h
> >  Remaining Estimate: 8h
> >
> > Now that we're dropping Java 1.4 compatibility for 3.0, we can
> incorporate Junit4 in testing. Junit3 and junit4 tests can coexist, so no
> tests should have to be rewritten. We should start this for the 3.1 release
> so we can get a clean 3.0 out smoothly.
> > It's probably worthwhile to convert a small set of tests as an exemplar.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

[jira] Created: (LUCENE-2110) Change FilteredTermsEnum to work like Iterator, so it is not positioned and next() must be always called first. Remove empty()

2009-12-03 Thread Uwe Schindler (JIRA)

Change FilteredTermsEnum to work like Iterator, so it is not positioned and 
next() must be always called first. Remove empty()
--

 Key: LUCENE-2110
 URL: https://issues.apache.org/jira/browse/LUCENE-2110
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: Flex Branch
Reporter: Uwe Schindler
 Fix For: Flex Branch


FilteredTermsEnum is confusing as it is initially positioned to the first term. 
It should instead work like an uninitialized TermsEnum for a field before the 
first call to next() or seek().
Also document that not all FilteredTermsEnums may implement seek() as eg. NRQ 
or Automaton are not able to support this. Seeking is also not needed for MTQ 
at all, so seek can just throw UOE.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2109) Make DocsEnum subclass of DocIdSetIterator

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785581#action_12785581
 ] 

Uwe Schindler commented on LUCENE-2109:
---

The DocIdSetIterator approach can be easily used for some queries or scorers as 
it could be e.g. directly returned by filters. A TermFilter is simply returning 
the DocsEnum for the specific TermRef as iterator, very simple.

> Make DocsEnum subclass of DocIdSetIterator
> --
>
> Key: LUCENE-2109
> URL: https://issues.apache.org/jira/browse/LUCENE-2109
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2109.patch
>
>
> Spinoff from LUCENE-1458:
> One thing I came along long time ago, but now with a new API it get's 
> interesting again: 
> DocsEnum should extend DocIdSetIterator, that would make it simplier to use 
> and implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so 
> on. You could e.g. write a filter for all documents that simply returns the 
> docs enumeration from IndexReader.
> So it should be an abstract class that extends DocIdSetIterator. It has the 
> same methods, only some methods must be a little bit renamed. The problem is, 
> because java does not support multiple inheritace, we cannot also extends 
> attributesource  Would DocIdSetIterator be an interface it would work (this 
> is one of the cases where interfaces for really simple patterns can be used, 
> like iterators).
> The problem with multiple inheritance could be solved by an additional method 
> attributes() that creates a new AttributeSource on first access then (because 
> constructing an AttributeSource is costly).  The same applies for the other 
> *Enums, it should be separated for lazy init.
> DocsEnum could look like this:
> {code}
> public abstract class DocsEnum extends DocIdSetIterator {
>   private AttributeSource atts = null;
>   public int freq()
>   public DontKnowClassName positions()
>   public final AttributeSource attributes() {
>if (atts==null) atts=new AttributeSource();
>return atts;
>   }
>   ...default impl of the bulk access using the abstract methods from 
> DocIdSetIterator
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2109) Make DocsEnum subclass of DocIdSetIterator

2009-12-03 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-2109:
--

Attachment: LUCENE-2109.patch

Here the patch.

> Make DocsEnum subclass of DocIdSetIterator
> --
>
> Key: LUCENE-2109
> URL: https://issues.apache.org/jira/browse/LUCENE-2109
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2109.patch
>
>
> Spinoff from LUCENE-1458:
> One thing I came along long time ago, but now with a new API it get's 
> interesting again: 
> DocsEnum should extend DocIdSetIterator, that would make it simplier to use 
> and implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so 
> on. You could e.g. write a filter for all documents that simply returns the 
> docs enumeration from IndexReader.
> So it should be an abstract class that extends DocIdSetIterator. It has the 
> same methods, only some methods must be a little bit renamed. The problem is, 
> because java does not support multiple inheritace, we cannot also extends 
> attributesource  Would DocIdSetIterator be an interface it would work (this 
> is one of the cases where interfaces for really simple patterns can be used, 
> like iterators).
> The problem with multiple inheritance could be solved by an additional method 
> attributes() that creates a new AttributeSource on first access then (because 
> constructing an AttributeSource is costly).  The same applies for the other 
> *Enums, it should be separated for lazy init.
> DocsEnum could look like this:
> {code}
> public abstract class DocsEnum extends DocIdSetIterator {
>   private AttributeSource atts = null;
>   public int freq()
>   public DontKnowClassName positions()
>   public final AttributeSource attributes() {
>if (atts==null) atts=new AttributeSource();
>return atts;
>   }
>   ...default impl of the bulk access using the abstract methods from 
> DocIdSetIterator
> }
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2109) Make DocsEnum subclass of DocIdSetIterator

2009-12-03 Thread Uwe Schindler (JIRA)

Make DocsEnum subclass of DocIdSetIterator
--

 Key: LUCENE-2109
 URL: https://issues.apache.org/jira/browse/LUCENE-2109
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Affects Versions: Flex Branch
Reporter: Uwe Schindler
Assignee: Uwe Schindler
 Fix For: Flex Branch


Spinoff from LUCENE-1458:

One thing I came along long time ago, but now with a new API it get's 
interesting again: 
DocsEnum should extend DocIdSetIterator, that would make it simplier to use and 
implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so on. You 
could e.g. write a filter for all documents that simply returns the docs 
enumeration from IndexReader.

So it should be an abstract class that extends DocIdSetIterator. It has the 
same methods, only some methods must be a little bit renamed. The problem is, 
because java does not support multiple inheritace, we cannot also extends 
attributesource  Would DocIdSetIterator be an interface it would work (this is 
one of the cases where interfaces for really simple patterns can be used, like 
iterators).

The problem with multiple inheritance could be solved by an additional method 
attributes() that creates a new AttributeSource on first access then (because 
constructing an AttributeSource is costly).  The same applies for the other 
*Enums, it should be separated for lazy init.

DocsEnum could look like this:

{code}
public abstract class DocsEnum extends DocIdSetIterator {
  private AttributeSource atts = null;
  public int freq()
  public DontKnowClassName positions()
  public final AttributeSource attributes() {
   if (atts==null) atts=new AttributeSource();
   return atts;
  }
  ...default impl of the bulk access using the abstract methods from 
DocIdSetIterator
}
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

RE: Patches for flex branch

2009-12-03 Thread Uwe Schindler

Thats good!

I will create a new issue out of my DocIdSetIterator and AttributeSource
patch and we could then close the 1458.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Michael Busch [mailto:busch...@gmail.com]
> Sent: Thursday, December 03, 2009 11:12 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Patches for flex branch
> 
> OK, done!
> 
> I updated 1458 too. We can probably resolve that one now and open more
> specific issues.
> 
>   Michael
> 
> On 12/3/09 12:14 PM, Michael McCandless wrote:
> > +1, good idea!
> >
> > Mike
> >
> > On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch
> wrote:
> >
> >> I was thinking we could create a new version in Jira for the flex
> branch
> >> (that's what Hadoop HDFS is doing with their append branch). Then we
> can
> >> open new Jira issues with fix version=flex branch. It's getting a bit
> >> confusing to always use 1458 for all changes :)
> >>
> >>   Michael
> >>
> >> -
> >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >>
> >>
> >>
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2107) Add contrib/fast-vector-highlighter to Maven central repo

2009-12-03 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer reassigned LUCENE-2107:
---

Assignee: Simon Willnauer

> Add contrib/fast-vector-highlighter to Maven central repo
> -
>
> Key: LUCENE-2107
> URL: https://issues.apache.org/jira/browse/LUCENE-2107
> Project: Lucene - Java
>  Issue Type: Task
>  Components: contrib/*
>Affects Versions: 2.9.1, 3.0
>Reporter: Chas Emerick
>Assignee: Simon Willnauer
>
> I'm not at all familiar with the Lucene build/deployment process, but it 
> would be very nice if releases of the fast vector highlighter were pushed to 
> the maven central repository, as is done with other contrib modules.
> (Issue filed at the request of Grant Ingersoll.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785569#action_12785569
 ] 

Uwe Schindler edited comment on LUCENE-2091 at 12/3/09 10:17 PM:
-

Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most 
users also add fields that are e.g. catch-all fields (which would be the per 
doc IDF) but in addition they add special fields like numeric fields (which 
would not produce a good IDF, but at the moment this IDF is ignored). Some 
users also add fields simply for sorting. So a IDF for documents is impossible 
with Lucene. You can only use e.g. catch all fields (which are always a godd 
idea for non-fielded searches, because oring all fields together is slower that 
just indexing the same terms a second time in a catch-all field), e.g. 
"contents" contains all terms from "title", "subject", "mailtext" as an example 
for emails. But the IDF for BM25F could be taken from the "contents" field even 
when searching only for a title.

  was (Author: thetaphi):
Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most 
uses also add fields that are e.g. catch-all fields (which would be the IDF you 
want to have) but in addition they add special fields like numeric field (which 
would not produce a good IDF, at the moment this IDF is ignored). Some users 
also add fileds simply for sorting. So a IDF for documents is impossible with 
Lucene. You can only use e.g. catch all fields (which are always a godd idea 
for non-fielded searches, because oring all fields together is slower that just 
indexing the same terms a second time in a catch-all field), e.g. "contents" 
contains all terms from "title", "subject", "mailtext" as an example for 
emails. But the IDF for BM25F could be taken from the "contents" field even 
when searching only for a title.
  
> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785569#action_12785569
 ] 

Uwe Schindler commented on LUCENE-2091:
---

Thanks for the explanation!

About the IDF: The problem with a per-document IDF in lucene would be that most 
uses also add fields that are e.g. catch-all fields (which would be the IDF you 
want to have) but in addition they add special fields like numeric field (which 
would not produce a good IDF, at the moment this IDF is ignored). Some users 
also add fileds simply for sorting. So a IDF for documents is impossible with 
Lucene. You can only use e.g. catch all fields (which are always a godd idea 
for non-fielded searches, because oring all fields together is slower that just 
indexing the same terms a second time in a catch-all field), e.g. "contents" 
contains all terms from "title", "subject", "mailtext" as an example for 
emails. But the IDF for BM25F could be taken from the "contents" field even 
when searching only for a title.

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael Busch


OK, done!

I updated 1458 too. We can probably resolve that one now and open more 
specific issues.


 Michael

On 12/3/09 12:14 PM, Michael McCandless wrote:

+1, good idea!

Mike

On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch  wrote:
   

I was thinking we could create a new version in Jira for the flex branch
(that's what Hadoop HDFS is doing with their append branch). Then we can
open new Jira issues with fix version=flex branch. It's getting a bit
confusing to always use 1458 for all changes :)

  Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


 

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1458:
--

Affects Version/s: (was: 2.9)
   Flex Branch
Fix Version/s: Flex Branch

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: Flex Branch
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785567#action_12785567
 ] 

Simon Willnauer commented on LUCENE-2108:
-

Just a reminder - we need to fix the CHANGES.TXT entry once this is done. 

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785563#action_12785563
 ] 

Simon Willnauer commented on LUCENE-2108:
-

bq. I'd assume ensureOpen needs to be synchronized some way so that two threads 
can't open IndexSearchers concurrently?

this class is not threadsafe anyway. If you look at this snippet:
{code}
 // close the old searcher, if there was one
if (searcher != null) {
  searcher.close();
}
searcher = new IndexSearcher(this.spellIndex, true);
{code}
there could be a race if you concurrently reindex or set a new dictionary. IMO 
this should either be documented or made threadsafe. The close method should 
invalidate the spellchecker - it should not be possible to use a already closed 
Spellchecker.
The searcher should be somehow ref counted so that if there is a searcher still 
in use you can concurrently reindex / add a new dictionary to ensure that the 
same searcher is used throughout suggestSimilar(). 

I will take care of it once I get back tomorrow.

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785541#action_12785541
 ] 

Eirik Bjorsnos commented on LUCENE-2108:


Well not exactly. Simon's suggestion was just to throw an 
AlreadyClosedException instead of a NullPointerException which is probably ok 
and definitely easier.

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Joaquin Perez-Iglesias (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785538#action_12785538
 ] 

Joaquin Perez-Iglesias commented on LUCENE-2091:


Hi everybody,

I'm going to try to answer some of your questions,  when I started to develop 
this library I didn't want
to modify the Lucene code, moreover I tried to create a jar that could be 
straight added  to the official
Lucene distribution. That is the main reason why there are some duplicated 
classes.
So yes it would be better a tigher integration, and I believe we will get more 
support for different query types.

In relation with BM25 or BM25F they are equivalent, BM25F is the version for 
more than a field, so yes go for BM25F.
What it is really important is the way boost factors are applied, as you can 
see in the equation these must be applied to raw frequencies and not to 
normalized frequencies or saturated frequencies. 
(Currently Lucene is doing it after normalization and saturation of 
frequencies, what in my opinion is not the best approach.)
A more detailed explanation of BM25F and this issue can be found in this paper 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255

The problem, as I said, comes from IDF. In the BM25 equations family, IDF is 
always computed at document level (that is why
I recommend as heuristic to use the field with more terms, or use an special 
field that contains all the terms). As far as I know that is a problem
because Lucene doesn't store the document frequency per document but per field.

Otis is right as far as I know just changing similarity is not enough, some 
data is not available to TermScorer neither similarity and TermScorer
apply the obtained values from similarity in a way that make it incompatible 
with BM25.
It is really important to follow the steps as it appears in my explanation:

1. Normalize frequencies with document/field length and b factor.
2. Saturate the effect of frequency with k1 
3. Compute summatory of terms weights
4. Apply IDF

I really believe that this can be done (not sure how), so maybe we will need 
the suggestions of some 'scorer guru'.

> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-2108:



Reopening to get the AlreadyClosedException in there...

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785531#action_12785531
 ] 

Eirik Bjorsnos commented on LUCENE-2108:


Simon,

Yes, that sound excactly like what I was thinking when I said "some code
that reopens the searcher if the reference to it is null".

I just didn't include it in my patch because I couldn't figure out how to do it 
properly.

I'd assume ensureOpen needs to be synchronized some way so that two threads 
can't open IndexSearchers concurrently?



> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Simon Willnauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Willnauer updated LUCENE-2108:


Attachment: LUCENE-2108.patch

Something like that would be more appropriate IMO

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch, LUCENE-2108.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Simon Willnauer (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785527#action_12785527
 ] 

Simon Willnauer commented on LUCENE-2108:
-

Mike / Eirik,

If you set the searcher to null you might risk a NPE if suggestSimilar() or 
other methods are called afterwards. I would like to see something like 
ensureOpen() which throws an AlreadyClosedException  or something similar. I 
will upload a suggestion in a second but need to run so tread it just as a 
suggestion.

Simon 

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785524#action_12785524
 ] 

Steven Rowe commented on LUCENE-2074:
-

Thanks, Uwe, that makes sense.  My bad, I only skimmed the patch, and 
misunderstood "3.0" in one of the new files to refer to the Lucene version, not 
the Unicode version. :)

> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785507#action_12785507
 ] 

Michael McCandless commented on LUCENE-2108:


bq. Dude, you have be to a human to make mistakes as stupid as these!

Good point :)



> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Resolved: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-2108.


   Resolution: Fixed
Fix Version/s: 3.1
   3.0.1

Thanks Eirik!

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Fix For: 3.0.1, 3.1
>
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785504#action_12785504
 ] 

Eirik Bjorsnos commented on LUCENE-2108:



Dude, you have be to a human to make mistakes as stupid as these!

(pubic void close, public void close, public void close...)

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785500#action_12785500
 ] 

Michael McCandless commented on LUCENE-2108:


Note that you said "private" again ;)  I'm starting to wonder if you are not 
human!  Is this a turing test?

OK, ok, I'll make it public, and port back to the 3.0 branch!

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785492#action_12785492
 ] 

Eirik Bjorsnos commented on LUCENE-2108:


Haha, this is why I said the patch should be "pretty" trivial, instead of just 
"trivial" :-)

Yes, it should certainly be private. No idea how that happend. Must have been 
sleeping at the keyboad.

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2108:
--

Assignee: Michael McCandless

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
>Assignee: Michael McCandless
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785484#action_12785484
 ] 

Michael McCandless commented on LUCENE-2108:


Shouldn't the new close() method be public?

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2065) Java 5 port phase II

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2065:
--

Assignee: Michael McCandless

> Java 5 port phase II 
> -
>
> Key: LUCENE-2065
> URL: https://issues.apache.org/jira/browse/LUCENE-2065
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.1
> Environment: Java 5 
>Reporter: Kay Kay
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2065.patch
>
>
> LUCENE-1257 addresses the public API changes ( generics , mainly ) and other 
> j.u.c. package changes related to the API .  The changes are frozen and 
> closed for 3.0 . This would be a placeholder JIRA for 3.0+ version to address 
> the pending changes ( tests for generics etc.) and any other internal API 
> changes as necessary. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2065) Java 5 port phase II

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785480#action_12785480
 ] 

Michael McCandless commented on LUCENE-2065:


Kay Kay, it looks like this patch no longer cleanly applies -- can you sync it 
up w/ current trunk?  Thanks!


> Java 5 port phase II 
> -
>
> Key: LUCENE-2065
> URL: https://issues.apache.org/jira/browse/LUCENE-2065
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 3.1
> Environment: Java 5 
>Reporter: Kay Kay
> Fix For: 3.1
>
> Attachments: LUCENE-2065.patch
>
>
> LUCENE-1257 addresses the public API changes ( generics , mainly ) and other 
> j.u.c. package changes related to the API .  The changes are frozen and 
> closed for 3.0 . This would be a placeholder JIRA for 3.0+ version to address 
> the pending changes ( tests for generics etc.) and any other internal API 
> changes as necessary. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785479#action_12785479
 ] 

Michael McCandless commented on LUCENE-2037:


bq.  but there is another patch - LUCENE-2065 to port the existing tests to 
Java 5 generics

Ahh thanks for the reminder -- I can take this one as well, but, there will be 
conflicts b/w the two patches, I think.  Should we do the generics first 
(simpler change, but touches many files), and then the junit4 upgrade?

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Patches for flex branch

2009-12-03 Thread Michael McCandless

+1, good idea!

Mike

On Thu, Dec 3, 2009 at 2:41 PM, Michael Busch  wrote:
> I was thinking we could create a new version in Jira for the flex branch
> (that's what Hadoop HDFS is doing with their append branch). Then we can
> open new Jira issues with fix version=flex branch. It's getting a bit
> confusing to always use 1458 for all changes :)
>
>  Michael
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eirik Bjorsnos updated LUCENE-2108:
---

Attachment: LUCENE-2108-SpellChecker-close.patch

Patch that adds a close method to SpellChecker. The method calls close on the 
searcher used and then nulls the reference so that a new IndexSearcher will be 
created by the next call to setSpellIndex

> SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
> SpellChecker internally
> -
>
> Key: LUCENE-2108
> URL: https://issues.apache.org/jira/browse/LUCENE-2108
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spellchecker
>Affects Versions: 3.0
>Reporter: Eirik Bjorsnos
> Attachments: LUCENE-2108-SpellChecker-close.patch
>
>
> I can't find any way to close the IndexSearcher (and IndexReader) that
> is being used by SpellChecker internally.
> I've worked around this issue by keeping a single SpellChecker open
> for each index, but I'd really like to be able to close it and
> reopen it on demand without leaking file descriptors.
> Could we add a close() method to SpellChecker that will close the
> IndexSearcher and null the reference to it? And perhaps add some code
> that reopens the searcher if the reference to it is null? Or would
> that break thread safety of SpellChecker?
> The attached patch adds a close method but leaves it to the user to
> call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2108) SpellChecker file descriptor leak - no way to close the IndexSearcher used by SpellChecker internally

2009-12-03 Thread Eirik Bjorsnos (JIRA)

SpellChecker file descriptor leak - no way to close the IndexSearcher used by 
SpellChecker internally
-

 Key: LUCENE-2108
 URL: https://issues.apache.org/jira/browse/LUCENE-2108
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/spellchecker
Affects Versions: 3.0
Reporter: Eirik Bjorsnos


I can't find any way to close the IndexSearcher (and IndexReader) that
is being used by SpellChecker internally.

I've worked around this issue by keeping a single SpellChecker open
for each index, but I'd really like to be able to close it and
reopen it on demand without leaking file descriptors.

Could we add a close() method to SpellChecker that will close the
IndexSearcher and null the reference to it? And perhaps add some code
that reopens the searcher if the reference to it is null? Or would
that break thread safety of SpellChecker?

The attached patch adds a close method but leaves it to the user to
call setSpellIndex to reopen the searcher if desired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

2009-12-03 Thread Otis Gospodnetic (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785473#action_12785473
 ] 

Otis Gospodnetic commented on LUCENE-2091:
--

+1 for skipping BM25 and going straight to BM25F.

I think the answer to Uwe's question about why this can't just be a different 
Similarity or some such is that BM25 requires some data that Lucene currently 
doesn't collect.  That's what there were some of those static methods in 
examples on the author's site.  I *think* what I'm saying is correct. :)


> Add BM25 Scoring to Lucene
> --
>
> Key: LUCENE-2091
> URL: https://issues.apache.org/jira/browse/LUCENE-2091
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Yuval Feinstein
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Patches for flex branch

2009-12-03 Thread Michael Busch

I was thinking we could create a new version in Jira for the flex branch 
(that's what Hadoop HDFS is doing with their append branch). Then we can 
open new Jira issues with fix version=flex branch. It's getting a bit 
confusing to always use 1458 for all changes :)


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Kay Kay (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785448#action_12785448
 ] 

Kay Kay commented on LUCENE-2037:
-

+1 w.r.t JUnit 4 .

Unrelated to this - but there is another patch - LUCENE-2065 to port the 
existing tests to Java 5 generics . May be - somebody can have a look at it 
before it becomes out of sync with the trunk altogether.

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785416#action_12785416
 ] 

Uwe Schindler commented on LUCENE-2074:
---

bq. I'm actually surprised that the DFAs are identical, since I'm almost 
certain that the set of characters matching [:letter:] changed between Unicode 
3.0 and Unicode 4.0 (maybe [:digit:] too). I'll take a look this weekend.

Because of that we have the patch: We now have two flex files, one with 
%unicode 3.0, which produces the same DFA as the old flex file when processed 
with Java 1.4 (as it was in Lucene 2.x). This is used for backwards 
compatibiulity (using the matchVersion parameter of ctor).

For later Lucene versions we will have a new jflex file (currently unicode 4.0) 
and that produces the same matrix as java 1.5 in jflex 1.4 (at the moment).

By that we simply made the parser regeneration invariant to the developer's 
JVM. About nothing more is this issue at the moment.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785414#action_12785414
 ] 

Steven Rowe commented on LUCENE-2074:
-

bq. Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the 
moment?

I think it's fine to do that.

bq. The new parsers (see patch) are pre-generated in SVN, so somebody compiling 
lucene from source does need to use jflex. And the parsers for 
StandardTokenizer are verified to work correct and are even identical (DFA 
wise) for the old Java 1.4 / Unicode 3.0 case.

Most of the StandardTokenizerImpl.jflex grammar is expressed in absolute terms 
- the only JVM-/Unicode-version-sensistive usages are [:letter:] and [:digit:], 
which under JFlex <1.5 were expanded using the scanner-generation-time JVM's 
Character.isLetter() and .isDigit() definitions, but under JFlex 1.5-SNAPSHOT 
depend on the declared Unicode version definitions (i.e., [:letter:] = 
\p{Letter}).

I'm actually surprised that the DFAs are identical, since I'm almost certain 
that the set of characters matching [:letter:] changed between Unicode 3.0 and 
Unicode 4.0 (maybe [:digit:] too).  I'll take a look this weekend.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785408#action_12785408
 ] 

Michael McCandless commented on LUCENE-2037:


Anyone have any concerns upgrading to Junit4?  I plan to commit in a few days...

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2037) Allow Junit4 tests in our environment.

2009-12-03 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-2037:
--

Assignee: Michael McCandless  (was: Erick Erickson)

> Allow Junit4 tests in our environment.
> --
>
> Key: LUCENE-2037
> URL: https://issues.apache.org/jira/browse/LUCENE-2037
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Affects Versions: 3.1
> Environment: Development
>Reporter: Erick Erickson
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
> Attachments: junit-4.7.jar, LUCENE-2037.patch, LUCENE-2037.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> Now that we're dropping Java 1.4 compatibility for 3.0, we can incorporate 
> Junit4 in testing. Junit3 and junit4 tests can coexist, so no tests should 
> have to be rewritten. We should start this for the 3.1 release so we can get 
> a clean 3.0 out smoothly.
> It's probably worthwhile to convert a small set of tests as an exemplar.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: LUCENE-2037 (Junit4 capabilities)

2009-12-03 Thread Michael McCandless

I'll take it.  The patch looks good, and tests pass.  Thanks Erick,
and thanks for the reminder...

Mike

On Wed, Dec 2, 2009 at 12:52 PM, Erick Erickson  wrote:
> Is anyone thinking about committing this patch? And/or what do I need to
> do/should have done to indicate it's ready for review?
> Poor lonely patch, sitting out there all alone and neglected ...
> Erick

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

nightly build deploy to Maven repositories

2009-12-03 Thread Sanne Grinovero

Hello,
I'm needing to depend on some recently committed bugfix from Lucene's
2.9 branch in other OSS projects, using Maven2 for dependency
management.

Are there snapshots uploaded somewhere regularly? Could Hudson do that?
Looking into Hudson it appears that it regularly builds trunk;
wouldn't it be a good idea to have him also verify the 2.9 branch
until it's actively updated?

Regards,
Sanne

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785404#action_12785404
 ] 

Michael McCandless commented on LUCENE-1483:


OK I just removed SorterTemplate.java

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> 
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Assignee: Michael McCandless
>Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, 
> sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2102) LowerCaseFilter for Turkish language

2009-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785379#action_12785379
 ] 

Robert Muir commented on LUCENE-2102:
-

Hello Simon, if this issue is resolved (so we do not forget), can we open a 
separate issue to fix the SnowballAnalyzer when using Turkish language?
I also think we should add some javadocs to the snowball stem filter that 
explain you need to use this filter beforehand for it to work.

I already have some unit tests produced showing it doesn't work correctly with 
LowerCaseFilter and that it also does not handle uppercase.


> LowerCaseFilter for Turkish language
> 
>
> Key: LUCENE-2102
> URL: https://issues.apache.org/jira/browse/LUCENE-2102
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Ahmet Arslan
>Assignee: Simon Willnauer
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, 
> LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, LUCENE-2102.patch, 
> LUCENE-2102.patch
>
>
> java.lang.Character.toLowerCase() converts 'I' to 'i' however in Turkish 
> alphabet lowercase of 'I' is not 'i'. It is LATIN SMALL LETTER DOTLESS I.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785377#action_12785377
 ] 

Michael McCandless commented on LUCENE-1458:


bq.  There is no method in IndexReader that returns all docs?

Not yet (in flex API) -- we can add it?  IndexReader.allDocs(Bits skipDocs)?  
Or we could make AllDocsEnum public?  Hmm.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For

Re: "too many open files" on micro benchmark

2009-12-03 Thread Mark Miller

Yup, that was the issue.

Michael McCandless wrote:
> Make that LUCENE-2106 :)
>
> Mike
>
> On Thu, Dec 3, 2009 at 10:41 AM, Michael McCandless
>  wrote:
>   
>> Is this due to LUCENE-1206?
>>
>> Mike
>>
>> On Thu, Dec 3, 2009 at 8:34 AM, Mark Miller  wrote:
>> 
>>> Anyone else seeing this?
>>>
>>> Now when I try and run the micro-benchmark on trunk or flex branch, a
>>> few seconds in, I get :
>>>
>>> [java] Running algorithm from:
>>> /home/mark/workspace/lucene/contrib/benchmark/conf/micro-standard.alg
>>> [java] > config properties:
>>> [java] analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>>> [java] compound = true
>>> [java] content.source =
>>> org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
>>> [java] directory = FSDirectory
>>> [java] doc.stored = true
>>> [java] doc.term.vector = false
>>> [java] doc.tokenized = true
>>> [java] docs.dir = reuters-out
>>> [java] log.queries = true
>>> [java] log.step = 500
>>> [java] max.buffered = buf:10:10:100:100
>>> [java] merge.factor = mrg:10:100:10:100
>>> [java] query.maker =
>>> org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>>> [java] task.max.depth.log = 2
>>> [java] work.dir = work
>>> [java] ---
>>> [java] > queries:
>>> [java] 0. TermQuery - body:salomon
>>> [java] 1. TermQuery - body:comex
>>> [java] 2. BooleanQuery - body:night body:trading
>>> [java] 3. BooleanQuery - body:japan body:sony
>>> [java] 4. PhraseQuery - body:"sony japan"
>>> [java] 5. PhraseQuery - body:"food needs"~3
>>> [java] 6. BooleanQuery - +body:"world bank"^2.0 +body:nigeria
>>> [java] 7. BooleanQuery - body:"world bank" -body:nigeria
>>> [java] 8. PhraseQuery - body:"ford credit"~5
>>> [java] 9. BooleanQuery - body:airline body:europe body:canada
>>> body:destination
>>> [java] 10. BooleanQuery - body:long body:term body:pressure
>>> body:trade body:ministers body:necessary body:current body:uruguay
>>> body:round body:talks body:general body:agreement body:trade
>>> body:tariffs body:gatt body:succeed
>>> [java] 11. SpanFirstQuery - spanFirst(body:ford, 5)
>>> [java] 12. SpanNearQuery - spanNear([body:night, body:trading], 4,
>>> false)
>>> [java] 13. SpanNearQuery - spanNear([spanFirst(body:ford, 10),
>>> body:credit], 10, false)
>>> [java] 14. WildcardQuery - body:fo*
>>> [java] > algorithm:
>>> [java] Seq {
>>> [java] Rounds_4 {
>>> [java] ResetSystemErase
>>> [java] Populate {
>>> [java] -CreateIndex
>>> [java] MAddDocs_2000 {
>>> [java] AddDoc
>>> [java] > * 2000
>>> [java] -Optimize
>>> [java] -CloseIndex
>>> [java] }
>>> [java] OpenReader
>>> [java] SearchSameRdr_5000 {
>>> [java] Search
>>> [java] > * 5000
>>> [java] CloseReader
>>> [java] WarmNewRdr_50 {
>>> [java] Warm
>>> [java] > * 50
>>> [java] SrchNewRdr_500 {
>>> [java] Search
>>> [java] > * 500
>>> [java] SrchTrvNewRdr_300 {
>>> [java] SearchTrav(1000.0)
>>> [java] > * 300
>>> [java] SrchTrvRetNewRdr_100 {
>>> [java] SearchTravRet(2000.0)
>>> [java] > * 100
>>> [java] NewRound
>>> [java] } * 4
>>> [java] RepSumByName
>>> [java] RepSumByPrefRound MAddDocs
>>> [java] }
>>> [java] > starting task: Seq
>>> [java] > starting task: Rounds_4
>>> [java] > starting task: Populate
>>> [java] 1.5 sec --> main added 500 docs
>>> [java] 2.12 sec --> main added 1000 docs
>>> [java] 2.51 sec --> main added 1500 docs
>>> [java] 2.88 sec --> main added 2000 docs
>>> [java] > starting task: OpenReader
>>> [java] > starting task: SearchSameRdr_5000
>>> [java] 3.3 sec --> main processed 500 records
>>> [java] 3.39 sec --> main processed 1000 records
>>> [java] 3.45 sec --> main processed 1500 records
>>> [java] 3.5 sec --> main processed 2000 records
>>> [java] 3.54 sec --> main processed 2500 records
>>> [java] 3.58 sec --> main processed 3000 records
>>> [java] 3.62 sec --> main processed 3500 records
>>> [java] 3.66 sec --> main processed 4000 records
>>> [java] 3.69 sec --> main processed 4500 records
>>> [java] 3.72 sec --> main processed 5000 records
>>> [java] > starting task: CloseReader
>>> [java] > starting task: WarmNewRdr_50
>>> [java] > starting task: SrchNewRdr_500
>>> [java] 
>>> [java] ###  D O N E !!! ###
>>> [java] 
>>> [java

[jira] Created: (LUCENE-2107) Add contrib/fast-vector-highlighter to Maven central repo

2009-12-03 Thread Chas Emerick (JIRA)

Add contrib/fast-vector-highlighter to Maven central repo
-

 Key: LUCENE-2107
 URL: https://issues.apache.org/jira/browse/LUCENE-2107
 Project: Lucene - Java
  Issue Type: Task
  Components: contrib/*
Affects Versions: 3.0, 2.9.1
Reporter: Chas Emerick


I'm not at all familiar with the Lucene build/deployment process, but it would 
be very nice if releases of the fast vector highlighter were pushed to the 
maven central repository, as is done with other contrib modules.

(Issue filed at the request of Grant Ingersoll.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785365#action_12785365
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 12/3/09 3:53 PM:


bq. Sweet! Wait, using AllDocsEnum you mean?

Yes, but this class is package private and unused! AllTermDocs is used by 
SegmentReader to support termDocs(null), but not AllDocsEnum. There is no 
method in IndexReader that returns all docs?

The matchAllDocs was just an example, there are more use cases, e.g. a 
TermsFilter (that is the non-scoring TermQuery variant): Just use the DocsEnum 
of this term as the DicIdSetIterator.

  was (Author: thetaphi):
bq. Sweet! Wait, using AllDocsEnum you mean?

Yes, but this class is package private and unused! AllTermDocs is used by 
SegmentReader to support termDocs(null), but not AllDocsEnum. There is no 
method in IndexReader that returns all docs?

The matchAllDocs was just an example, there are more use cases.
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785365#action_12785365
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. Sweet! Wait, using AllDocsEnum you mean?

Yes, but this class is package private and unused! AllTermDocs is used by 
SegmentReader to support termDocs(null), but not AllDocsEnum. There is no 
method in IndexReader that returns all docs?

The matchAllDocs was just an example, there are more use cases.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--

Re: "too many open files" on micro benchmark

2009-12-03 Thread Michael McCandless

Is this due to LUCENE-1206?

Mike

On Thu, Dec 3, 2009 at 8:34 AM, Mark Miller  wrote:
> Anyone else seeing this?
>
> Now when I try and run the micro-benchmark on trunk or flex branch, a
> few seconds in, I get :
>
>     [java] Running algorithm from:
> /home/mark/workspace/lucene/contrib/benchmark/conf/micro-standard.alg
>     [java] > config properties:
>     [java] analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>     [java] compound = true
>     [java] content.source =
> org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
>     [java] directory = FSDirectory
>     [java] doc.stored = true
>     [java] doc.term.vector = false
>     [java] doc.tokenized = true
>     [java] docs.dir = reuters-out
>     [java] log.queries = true
>     [java] log.step = 500
>     [java] max.buffered = buf:10:10:100:100
>     [java] merge.factor = mrg:10:100:10:100
>     [java] query.maker =
> org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>     [java] task.max.depth.log = 2
>     [java] work.dir = work
>     [java] ---
>     [java] > queries:
>     [java] 0. TermQuery - body:salomon
>     [java] 1. TermQuery - body:comex
>     [java] 2. BooleanQuery - body:night body:trading
>     [java] 3. BooleanQuery - body:japan body:sony
>     [java] 4. PhraseQuery - body:"sony japan"
>     [java] 5. PhraseQuery - body:"food needs"~3
>     [java] 6. BooleanQuery - +body:"world bank"^2.0 +body:nigeria
>     [java] 7. BooleanQuery - body:"world bank" -body:nigeria
>     [java] 8. PhraseQuery - body:"ford credit"~5
>     [java] 9. BooleanQuery - body:airline body:europe body:canada
> body:destination
>     [java] 10. BooleanQuery - body:long body:term body:pressure
> body:trade body:ministers body:necessary body:current body:uruguay
> body:round body:talks body:general body:agreement body:trade
> body:tariffs body:gatt body:succeed
>     [java] 11. SpanFirstQuery - spanFirst(body:ford, 5)
>     [java] 12. SpanNearQuery - spanNear([body:night, body:trading], 4,
> false)
>     [java] 13. SpanNearQuery - spanNear([spanFirst(body:ford, 10),
> body:credit], 10, false)
>     [java] 14. WildcardQuery - body:fo*
>     [java] > algorithm:
>     [java] Seq {
>     [java]     Rounds_4 {
>     [java]         ResetSystemErase
>     [java]         Populate {
>     [java]             -CreateIndex
>     [java]             MAddDocs_2000 {
>     [java]                 AddDoc
>     [java]             > * 2000
>     [java]             -Optimize
>     [java]             -CloseIndex
>     [java]         }
>     [java]         OpenReader
>     [java]         SearchSameRdr_5000 {
>     [java]             Search
>     [java]         > * 5000
>     [java]         CloseReader
>     [java]         WarmNewRdr_50 {
>     [java]             Warm
>     [java]         > * 50
>     [java]         SrchNewRdr_500 {
>     [java]             Search
>     [java]         > * 500
>     [java]         SrchTrvNewRdr_300 {
>     [java]             SearchTrav(1000.0)
>     [java]         > * 300
>     [java]         SrchTrvRetNewRdr_100 {
>     [java]             SearchTravRet(2000.0)
>     [java]         > * 100
>     [java]         NewRound
>     [java]     } * 4
>     [java]     RepSumByName
>     [java]     RepSumByPrefRound MAddDocs
>     [java] }
>     [java] > starting task: Seq
>     [java] > starting task: Rounds_4
>     [java] > starting task: Populate
>     [java] 1.5 sec --> main added 500 docs
>     [java] 2.12 sec --> main added 1000 docs
>     [java] 2.51 sec --> main added 1500 docs
>     [java] 2.88 sec --> main added 2000 docs
>     [java] > starting task: OpenReader
>     [java] > starting task: SearchSameRdr_5000
>     [java] 3.3 sec --> main processed 500 records
>     [java] 3.39 sec --> main processed 1000 records
>     [java] 3.45 sec --> main processed 1500 records
>     [java] 3.5 sec --> main processed 2000 records
>     [java] 3.54 sec --> main processed 2500 records
>     [java] 3.58 sec --> main processed 3000 records
>     [java] 3.62 sec --> main processed 3500 records
>     [java] 3.66 sec --> main processed 4000 records
>     [java] 3.69 sec --> main processed 4500 records
>     [java] 3.72 sec --> main processed 5000 records
>     [java] > starting task: CloseReader
>     [java] > starting task: WarmNewRdr_50
>     [java] > starting task: SrchNewRdr_500
>     [java] 
>     [java] ###  D O N E !!! ###
>     [java] 
>     [java] Error: cannot execute the algorithm!
> /home/mark/workspace/lucene/contrib/benchmark/work/index/_0.cfx (Too
> many open files)
>     [java] java.io.FileNotFoundException:
> /home/mark/workspace/lucene/contrib/benchmark/work/index/_0.cfx (Too
> many open files)
>     [java]     at java.io.RandomAccessFile.open(Native Method)
>     [java]     at
> java.io.RandomAccessFile.(RandomAccessFile.jav

Re: "too many open files" on micro benchmark

2009-12-03 Thread Michael McCandless

Make that LUCENE-2106 :)

Mike

On Thu, Dec 3, 2009 at 10:41 AM, Michael McCandless
 wrote:
> Is this due to LUCENE-1206?
>
> Mike
>
> On Thu, Dec 3, 2009 at 8:34 AM, Mark Miller  wrote:
>> Anyone else seeing this?
>>
>> Now when I try and run the micro-benchmark on trunk or flex branch, a
>> few seconds in, I get :
>>
>>     [java] Running algorithm from:
>> /home/mark/workspace/lucene/contrib/benchmark/conf/micro-standard.alg
>>     [java] > config properties:
>>     [java] analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
>>     [java] compound = true
>>     [java] content.source =
>> org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
>>     [java] directory = FSDirectory
>>     [java] doc.stored = true
>>     [java] doc.term.vector = false
>>     [java] doc.tokenized = true
>>     [java] docs.dir = reuters-out
>>     [java] log.queries = true
>>     [java] log.step = 500
>>     [java] max.buffered = buf:10:10:100:100
>>     [java] merge.factor = mrg:10:100:10:100
>>     [java] query.maker =
>> org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
>>     [java] task.max.depth.log = 2
>>     [java] work.dir = work
>>     [java] ---
>>     [java] > queries:
>>     [java] 0. TermQuery - body:salomon
>>     [java] 1. TermQuery - body:comex
>>     [java] 2. BooleanQuery - body:night body:trading
>>     [java] 3. BooleanQuery - body:japan body:sony
>>     [java] 4. PhraseQuery - body:"sony japan"
>>     [java] 5. PhraseQuery - body:"food needs"~3
>>     [java] 6. BooleanQuery - +body:"world bank"^2.0 +body:nigeria
>>     [java] 7. BooleanQuery - body:"world bank" -body:nigeria
>>     [java] 8. PhraseQuery - body:"ford credit"~5
>>     [java] 9. BooleanQuery - body:airline body:europe body:canada
>> body:destination
>>     [java] 10. BooleanQuery - body:long body:term body:pressure
>> body:trade body:ministers body:necessary body:current body:uruguay
>> body:round body:talks body:general body:agreement body:trade
>> body:tariffs body:gatt body:succeed
>>     [java] 11. SpanFirstQuery - spanFirst(body:ford, 5)
>>     [java] 12. SpanNearQuery - spanNear([body:night, body:trading], 4,
>> false)
>>     [java] 13. SpanNearQuery - spanNear([spanFirst(body:ford, 10),
>> body:credit], 10, false)
>>     [java] 14. WildcardQuery - body:fo*
>>     [java] > algorithm:
>>     [java] Seq {
>>     [java]     Rounds_4 {
>>     [java]         ResetSystemErase
>>     [java]         Populate {
>>     [java]             -CreateIndex
>>     [java]             MAddDocs_2000 {
>>     [java]                 AddDoc
>>     [java]             > * 2000
>>     [java]             -Optimize
>>     [java]             -CloseIndex
>>     [java]         }
>>     [java]         OpenReader
>>     [java]         SearchSameRdr_5000 {
>>     [java]             Search
>>     [java]         > * 5000
>>     [java]         CloseReader
>>     [java]         WarmNewRdr_50 {
>>     [java]             Warm
>>     [java]         > * 50
>>     [java]         SrchNewRdr_500 {
>>     [java]             Search
>>     [java]         > * 500
>>     [java]         SrchTrvNewRdr_300 {
>>     [java]             SearchTrav(1000.0)
>>     [java]         > * 300
>>     [java]         SrchTrvRetNewRdr_100 {
>>     [java]             SearchTravRet(2000.0)
>>     [java]         > * 100
>>     [java]         NewRound
>>     [java]     } * 4
>>     [java]     RepSumByName
>>     [java]     RepSumByPrefRound MAddDocs
>>     [java] }
>>     [java] > starting task: Seq
>>     [java] > starting task: Rounds_4
>>     [java] > starting task: Populate
>>     [java] 1.5 sec --> main added 500 docs
>>     [java] 2.12 sec --> main added 1000 docs
>>     [java] 2.51 sec --> main added 1500 docs
>>     [java] 2.88 sec --> main added 2000 docs
>>     [java] > starting task: OpenReader
>>     [java] > starting task: SearchSameRdr_5000
>>     [java] 3.3 sec --> main processed 500 records
>>     [java] 3.39 sec --> main processed 1000 records
>>     [java] 3.45 sec --> main processed 1500 records
>>     [java] 3.5 sec --> main processed 2000 records
>>     [java] 3.54 sec --> main processed 2500 records
>>     [java] 3.58 sec --> main processed 3000 records
>>     [java] 3.62 sec --> main processed 3500 records
>>     [java] 3.66 sec --> main processed 4000 records
>>     [java] 3.69 sec --> main processed 4500 records
>>     [java] 3.72 sec --> main processed 5000 records
>>     [java] > starting task: CloseReader
>>     [java] > starting task: WarmNewRdr_50
>>     [java] > starting task: SrchNewRdr_500
>>     [java] 
>>     [java] ###  D O N E !!! ###
>>     [java] 
>>     [java] Error: cannot execute the algorithm!
>> /home/mark/workspace/lucene/contrib/benchmark/work/index/_0.cfx (Too
>> many open files)
>>     [java] java.io.FileNotFoundException:
>> /home/mar

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785362#action_12785362
 ] 

Michael McCandless commented on LUCENE-1458:


bq. How about starting w/o reuse but leave a TODO saying we could/should 
investigate?

Actually, scratch that -- reuse is too hard in DBLRU -- I would say just no 
reuse now.  Trunk doesn't reuse either...

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785360#action_12785360
 ] 

Michael McCandless commented on LUCENE-1458:


Patch looks good Uwe!

bq.  MatchAllDocsQuery is very simple to implement now as a ConstantScoreQuery 
on top of a Filter that returns the DocsEnum of the supplied IndexReader as 
iterator. Really cool.

Sweet!  Wait, using AllDocsEnum you mean?

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To uns

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785356#action_12785356
 ] 

Michael McCandless commented on LUCENE-1458:


bq. Should we still try and do the reuse stuff, or should we just drop it and 
use the cache as it is now?

How about starting w/o reuse but leave a TODO saying we could/should 
investigate?

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.or

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785348#action_12785348
 ] 

Uwe Schindler commented on LUCENE-2074:
---

Thanks Steve.

Do you see a problem with just requiring Flex 1.5 for Lucene trunk at the 
moment? It would hep us to not require a Java 1.4 JRE just to convert the jflex 
files.

The new parsers (see patch) are pre-generated in SVN, so somebody compiling 
lucene from source does need to use jflex. And the parsers for 
StandardTokenizer are verified to work correct and are even identical (DFA 
wise) for the old Java 1.4 / Unicode 3.0 case.

> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used

2009-12-03 Thread Mark Miller (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller updated LUCENE-2106:


Attachment: LUCENE-2106.patch

> Benchmark does not close its Reader when OpenReader/CloseReader are not used
> 
>
> Key: LUCENE-2106
> URL: https://issues.apache.org/jira/browse/LUCENE-2106
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Mark Miller
>Assignee: Mark Miller
> Attachments: LUCENE-2106.patch
>
>
> Only the Searcher is closed, but because the reader is passed to the 
> Searcher, the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2106) Benchmark does not close its Reader when OpenReader/CloseReader are not used

2009-12-03 Thread Mark Miller (JIRA)

Benchmark does not close its Reader when OpenReader/CloseReader are not used


 Key: LUCENE-2106
 URL: https://issues.apache.org/jira/browse/LUCENE-2106
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/benchmark
Reporter: Mark Miller
Assignee: Mark Miller


Only the Searcher is closed, but because the reader is passed to the Searcher, 
the Searcher does not close the Reader, causing a resource leak.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2074) Use a separate JFlex generated Unicode 4 by Java 5 compatible StandardTokenizer

2009-12-03 Thread Steven Rowe (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785344#action_12785344
 ] 

Steven Rowe commented on LUCENE-2074:
-

bq. Will the old jflex fail on %unicode {x.y} syntax ???

I haven't tested it, but JFlex <1.5 likely will fail on this syntax, since 
nothing is expected after the %unicode directive.

bq. Hopefully JFlex 1.5 comes out until we release 3.1, I would be happy.

I think the JFlex 1.5 release will happen before March of next year, since 
we're down to just a few blocking issues.


> Use a separate JFlex generated Unicode 4 by Java 5 compatible 
> StandardTokenizer
> ---
>
> Key: LUCENE-2074
> URL: https://issues.apache.org/jira/browse/LUCENE-2074
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: 3.1
>
> Attachments: jflex-1.4.1-vs-1.5-snapshot.diff, jflexwarning.patch, 
> LUCENE-2074-lucene30.patch, LUCENE-2074.patch, LUCENE-2074.patch, 
> LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch, LUCENE-2074.patch
>
>
> The current trunk version of StandardTokenizerImpl was generated by Java 1.4 
> (according to the warning). In Java 3.0 we switch to Java 1.5, so we should 
> regenerate the file.
> After regeneration the Tokenizer behaves different for some characters. 
> Because of that we should only use the new TokenizerImpl when 
> Version.LUCENE_30 or LUCENE_31 is used as matchVersion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1458:
--

Attachment: LUCENE-1458-DocIdSetIterator.patch

Updated patch: 

I did a search on "AttributeSource" in index package. I now also replaced the 
"extends AttributeSource" by a lazy init in in FieldsEnum and PositionsEnum. So 
all enums have an attributes() method that lazy inits an AttributeSource. When 
attributes get interesting a custom DocsEnum could just use 
attributes().addAttribute(XYZ.class) in its ctor and store the reference 
locally. attributes() is final (to be safe, when called by ctor).

Eventually add an Interface AttributeAble *g* that is implemented by all these 
enums and anywhere else using AttributeSource that may need to be lazy init.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-DocIdSetIterator.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
>

[jira] Resolved: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

2009-12-03 Thread Robert Muir (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-2105.
-

Resolution: Duplicate

Duplicate of LUCENE-1215 (JDK 6 Impl) and LUCENE-1488 (ICU Impl)

> Lucene does not support Unicode Normalization Forms
> ---
>
> Key: LUCENE-2105
> URL: https://issues.apache.org/jira/browse/LUCENE-2105
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Alexander Veit
>
> Lucene should bring terms in their Unicode normalization form 
> (http://unicode.org/reports/tr15/), probably NFKC.
> E.g., currently words that contain ligatures such as "fi", "fl", "ff", or 
> "ffl" cannot be found in certain documents (try to find "undefined" in 
> http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2105) Lucene does not support Unicode Normalization Forms

2009-12-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785341#action_12785341
 ] 

Robert Muir commented on LUCENE-2105:
-

right there is a Filter in LUCENE-1488 for efficient unicode normalization. It 
implements .quickCheck() and works on char[]

The only other alternative is the JDK6 impl, which would be a lot less 
efficient, String-based and only .isNormalized(), no .quickCheck()

If people want me to break up LUCENE-1488 into smaller pieces and do them one 
piece at a time, we could go this route because the NormalizationFilter there 
IMHO is very clear, efficient, and will not change.

On the other hand I like the idea of consistency in solving that issue as a 
whole, as Normalization interacts with other processes such as Case Folding.


> Lucene does not support Unicode Normalization Forms
> ---
>
> Key: LUCENE-2105
> URL: https://issues.apache.org/jira/browse/LUCENE-2105
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Analysis
>Affects Versions: 3.0
>Reporter: Alexander Veit
>
> Lucene should bring terms in their Unicode normalization form 
> (http://unicode.org/reports/tr15/), probably NFKC.
> E.g., currently words that contain ligatures such as "fi", "fl", "ff", or 
> "ffl" cannot be found in certain documents (try to find "undefined" in 
> http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-1458:
--

Attachment: LUCENE-1458-DocIdSetIterator.patch

Here the patch with refactoring DocsEnum.

With this patch MatchAllDocsQuery is very simple to implement now as a 
ConstantScoreQuery on top of a Filter that returns the DocsEnum of the supplied 
IndexReader as iterator. Really cool.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-DocIdSetIterator.patch, 
> LUCENE-1458-MTQ-BW.patch, LUCENE-1458-NRQ.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458_rotate.patch, 
> LUCENE-1458_sortorder_bwcompat.patch, LUCENE-1458_termenum_bwcompat.patch, 
> UnicodeTestCase.patch, UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For addi

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785314#action_12785314
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 12/3/09 1:36 PM:


bq. It'd be great if we could find a way to do this without a big hairball of 
back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().

So I would define it as:

{code}
public abstract class DocsEnum extends DocIdSetIterator {
  private AttributeSource atts = null;
  public int freq()
  public DontKnowClassName positions()
  public final AttributeSource attributes() {
   if (atts==null) atts=new AttributeSource();
   return atts;
  }
  ...default impl of the bulk access using the abstract methods from 
DocIdSetIterator
}
{code}


  was (Author: thetaphi):
bq. It'd be great if we could find a way to do this without a big hairball 
of back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().

So I would define it as:

{code}
public abstract class DocsEnum extends DocIdSetIterator {
  private AttributeSource atts = null;
  public int freq()
  public DontKnowClassName positions()
  public AttributeSource attributes() {
   if (atts==null) atts=new AttributeSource();
   return atts;
  }
  ...default impl of the bulk access using the abstract methods from 
DocIdSetIterator
}
{code}

The same stra
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785314#action_12785314
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 12/3/09 1:34 PM:


bq. It'd be great if we could find a way to do this without a big hairball of 
back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().

So I would define it as:

{code}
public abstract class DocsEnum extends DocIdSetIterator {
  private AttributeSource atts = null;
  public int freq()
  public DontKnowClassName positions()
  public AttributeSource attributes() {
   if (atts==null) atts=new AttributeSource();
   return atts;
  }
  ...default impl of the bulk access using the abstract methods from 
DocIdSetIterator
}
{code}

The same stra

  was (Author: thetaphi):
bq. It'd be great if we could find a way to do this without a big hairball 
of back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, p

"too many open files" on micro benchmark

2009-12-03 Thread Mark Miller

Anyone else seeing this?

Now when I try and run the micro-benchmark on trunk or flex branch, a
few seconds in, I get :

 [java] Running algorithm from:
/home/mark/workspace/lucene/contrib/benchmark/conf/micro-standard.alg
 [java] > config properties:
 [java] analyzer = org.apache.lucene.analysis.standard.StandardAnalyzer
 [java] compound = true
 [java] content.source =
org.apache.lucene.benchmark.byTask.feeds.ReutersContentSource
 [java] directory = FSDirectory
 [java] doc.stored = true
 [java] doc.term.vector = false
 [java] doc.tokenized = true
 [java] docs.dir = reuters-out
 [java] log.queries = true
 [java] log.step = 500
 [java] max.buffered = buf:10:10:100:100
 [java] merge.factor = mrg:10:100:10:100
 [java] query.maker =
org.apache.lucene.benchmark.byTask.feeds.ReutersQueryMaker
 [java] task.max.depth.log = 2
 [java] work.dir = work
 [java] ---
 [java] > queries:
 [java] 0. TermQuery - body:salomon
 [java] 1. TermQuery - body:comex
 [java] 2. BooleanQuery - body:night body:trading
 [java] 3. BooleanQuery - body:japan body:sony
 [java] 4. PhraseQuery - body:"sony japan"
 [java] 5. PhraseQuery - body:"food needs"~3
 [java] 6. BooleanQuery - +body:"world bank"^2.0 +body:nigeria
 [java] 7. BooleanQuery - body:"world bank" -body:nigeria
 [java] 8. PhraseQuery - body:"ford credit"~5
 [java] 9. BooleanQuery - body:airline body:europe body:canada
body:destination
 [java] 10. BooleanQuery - body:long body:term body:pressure
body:trade body:ministers body:necessary body:current body:uruguay
body:round body:talks body:general body:agreement body:trade
body:tariffs body:gatt body:succeed
 [java] 11. SpanFirstQuery - spanFirst(body:ford, 5)
 [java] 12. SpanNearQuery - spanNear([body:night, body:trading], 4,
false)
 [java] 13. SpanNearQuery - spanNear([spanFirst(body:ford, 10),
body:credit], 10, false)
 [java] 14. WildcardQuery - body:fo*
 [java] > algorithm:
 [java] Seq {
 [java] Rounds_4 {
 [java] ResetSystemErase
 [java] Populate {
 [java] -CreateIndex
 [java] MAddDocs_2000 {
 [java] AddDoc
 [java] > * 2000
 [java] -Optimize
 [java] -CloseIndex
 [java] }
 [java] OpenReader
 [java] SearchSameRdr_5000 {
 [java] Search
 [java] > * 5000
 [java] CloseReader
 [java] WarmNewRdr_50 {
 [java] Warm
 [java] > * 50
 [java] SrchNewRdr_500 {
 [java] Search
 [java] > * 500
 [java] SrchTrvNewRdr_300 {
 [java] SearchTrav(1000.0)
 [java] > * 300
 [java] SrchTrvRetNewRdr_100 {
 [java] SearchTravRet(2000.0)
 [java] > * 100
 [java] NewRound
 [java] } * 4
 [java] RepSumByName
 [java] RepSumByPrefRound MAddDocs
 [java] }
 [java] > starting task: Seq
 [java] > starting task: Rounds_4
 [java] > starting task: Populate
 [java] 1.5 sec --> main added 500 docs
 [java] 2.12 sec --> main added 1000 docs
 [java] 2.51 sec --> main added 1500 docs
 [java] 2.88 sec --> main added 2000 docs
 [java] > starting task: OpenReader
 [java] > starting task: SearchSameRdr_5000
 [java] 3.3 sec --> main processed 500 records
 [java] 3.39 sec --> main processed 1000 records
 [java] 3.45 sec --> main processed 1500 records
 [java] 3.5 sec --> main processed 2000 records
 [java] 3.54 sec --> main processed 2500 records
 [java] 3.58 sec --> main processed 3000 records
 [java] 3.62 sec --> main processed 3500 records
 [java] 3.66 sec --> main processed 4000 records
 [java] 3.69 sec --> main processed 4500 records
 [java] 3.72 sec --> main processed 5000 records
 [java] > starting task: CloseReader
 [java] > starting task: WarmNewRdr_50
 [java] > starting task: SrchNewRdr_500
 [java] 
 [java] ###  D O N E !!! ###
 [java] 
 [java] Error: cannot execute the algorithm!
/home/mark/workspace/lucene/contrib/benchmark/work/index/_0.cfx (Too
many open files)
 [java] java.io.FileNotFoundException:
/home/mark/workspace/lucene/contrib/benchmark/work/index/_0.cfx (Too
many open files)
 [java] at java.io.RandomAccessFile.open(Native Method)
 [java] at
java.io.RandomAccessFile.(RandomAccessFile.java:233)
 [java] at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput$Descriptor.(SimpleFSDirectory.java:76)
 [java] at
org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirec

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785314#action_12785314
 ] 

Uwe Schindler commented on LUCENE-1458:
---

bq. It'd be great if we could find a way to do this without a big hairball of 
back compat code

DocsEnum is a new class, why not fit it from the beginning as DocIdSetIterator? 
In my opinion, as pointed out above, the AttributeSource stuff should go in as 
a lazy-init member behind getAttributes() / attributes().

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mai

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785312#action_12785312
 ] 

Mark Miller commented on LUCENE-1458:
-

RE: the terms cache

Should and still try and do the reuse stuff, or should we just drop it and use 
the cache as it is now? (eg reusing the object that is removed, if one is 
removed) Looks like that would be harder to get done now.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h.

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785312#action_12785312
 ] 

Mark Miller edited comment on LUCENE-1458 at 12/3/09 1:29 PM:
--

RE: the terms cache

Should we still try and do the reuse stuff, or should we just drop it and use 
the cache as it is now? (eg reusing the object that is removed, if one is 
removed) Looks like that would be harder to get done now.

  was (Author: markrmil...@gmail.com):
RE: the terms cache

Should and still try and do the reuse stuff, or should we just drop it and use 
the cache as it is now? (eg reusing the object that is removed, if one is 
removed) Looks like that would be harder to get done now.
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785310#action_12785310
 ] 

Michael McCandless commented on LUCENE-1458:


bq. getAttributes() returning it and dynamically instantiating would be an 
idea. The same applies for TermsEnum, it should be separated for lazy init.

That's a good point (avoid cost of creating the AttributeSource) -- that makes 
complete sense.

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional

[jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785308#action_12785308
 ] 

Michael McCandless commented on LUCENE-1458:


bq. DocsEnum should extend DocIdSetIterator

It'd be great if we could find a way to do this without a big hairball of back 
compat code ;)  They are basically the same, except DocsEnum lets you get 
freq() for each doc, get the PositionsEnum positions(), and also provides a 
bulk read API (w/ default impl).

> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should be faster to init too.
> .
> This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers.  EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
> old API on top of the new API to keep back-compat.
> 
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility).  EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e

[jira] Issue Comment Edited: (LUCENE-1458) Further steps towards flexible indexing

2009-12-03 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785303#action_12785303
 ] 

Uwe Schindler edited comment on LUCENE-1458 at 12/3/09 1:24 PM:


One thing I came along long time ago, but now with a new API it get's 
interesting again:

DocsEnum should extend DocIdSetIterator, that would make it simplier to use and 
implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so on. You 
could e.g. write a filter for all documents that simply returns the docs 
enumeration from IndexReader.

So it should be an abstract class that extends DocIdSetIterator. It has the 
same methods, only some methods must be a little bit renamed. The problem is, 
because java does not support multiple inheritace, we cannot also extends 
attributesource :-( Would DocIdSetIterator be an interface it would work (this 
is one of the cases where interfaces for really simple patterns can be used, 
like iterators).

*EDIT*

Maybe an idea would be to provide a method asDocIdSetIterator(), if the 
multiple inheritance cannot be fixed. Or have the AttributeSource as a member 
field, which would be good, as it only needs to be created on first access then 
(because constructing an AttributeSource is costly). getAttributes() returning 
it and dynamically instantiating would be an idea. The same applies for 
TermsEnum, it should be separated for lazy init.

  was (Author: thetaphi):
One thing I came along long time ago, but now with a new API it get's 
interesting again:

DocsEnum should extend DocIdSetIterator, that would make it simplier to use and 
implement e.g. in MatchAllDocQuery.Scorer, FieldCacheRangeFilter and so on. You 
could e.g. write a filter for all documents that simply returns the docs 
enumeration from IndexReader.

So it should be an abstract class that extends DocIdSetIterator. It has the 
same methods, only some methods must be a little bit renamed. The problem is, 
because java does not support multiple inheritace, we cannot also extends 
attributesource :-( Would DocIdSetIterator be an interface it would work (this 
is one of the cases where interfaces for really simple patterns can be used, 
like iterators).
  
> Further steps towards flexible indexing
> ---
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-MTQ-BW.patch, 
> LUCENE-1458-NRQ.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458_rotate.patch, LUCENE-1458_sortorder_bwcompat.patch, 
> LUCENE-1458_termenum_bwcompat.patch, UnicodeTestCase.patch, 
> UnicodeTestCase.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo).  At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta.  Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array).  It should

1 2 >

1 - 100 of 124 matches

Mail list logo