[jira] Commented: (LUCENE-1487) FieldCacheTermsFilter

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655942#action_12655942
 ] 

Michael McCandless commented on LUCENE-1487:


I think this is a useful filter impl, and a nice companion to FCRF.
I'd like to see it committed; formatting & test case are good next
steps.

TermsFilter (in contrib/queries) does the same thing, but creates a
bitset by docID up front by walking the TermDocs for each term.  An OR
query, wrapped in QueryWrapperFilter, is another way.

This impl uses FieldCache to create a bitset by term number and then
does a scan by docID, so it has different performance tradeoffs: for
"enum" fields (far more docs than unique terms -- like country, state,
etc.) it's fast to create this filter, and then applying the filter is
O(maxDocs) with a small constant factor.

I think for many apps it means you do not have to cache the filter
because creating & using it "on the fly" is plenty fast.



> FieldCacheTermsFilter
> -
>
> Key: LUCENE-1487
> URL: https://issues.apache.org/jira/browse/LUCENE-1487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Tim Sturge
> Fix For: 2.9
>
> Attachments: FieldCacheTermsFilter.java
>
>
> This is a companion to FieldCacheRangeFilter except it operates on a set of 
> terms rather than a range. It works best when the set is comparatively large 
> or the terms are comparatively common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1487) FieldCacheTermsFilter

2008-12-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1487:
--

Assignee: Michael McCandless

> FieldCacheTermsFilter
> -
>
> Key: LUCENE-1487
> URL: https://issues.apache.org/jira/browse/LUCENE-1487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Tim Sturge
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: FieldCacheTermsFilter.java
>
>
> This is a companion to FieldCacheRangeFilter except it operates on a set of 
> terms rather than a range. It works best when the set is comparatively large 
> or the terms are comparatively common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



2.9/3.0 plan & Java 1.5

2008-12-12 Thread Michael McCandless
Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API to java
> 5 for the 3.0 release, cause I thought back compat was less restricted
> 2-3. I guess mabye that won't end up happening, if it was going to, it
> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the deprecated
non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have too
many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.

Mike


[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655955#action_12655955
 ] 

Michael McCandless commented on LUCENE-1483:


Mark, I got on hunk (HitCollector) failed on applying the patch -- looks like 
it's the $Id$ issue again (your area doesn't expand $Id$ tags).  No problem -- 
I just applied it manually.

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1486:
---

Fix Version/s: 2.9

(Added 2.9 fix version in addition to 2.4.1).

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655957#action_12655957
 ] 

Michael McCandless commented on LUCENE-1483:


bq. adding a return type to the collect/hit method? ... ie: an enum style 
result indicating "OK" or "ABORT" (with the potential of adding additional 
constants later ala FieldSelectorResult)

I think we should consider this, though this then implies an if stmt checking 
the return result & doing something, on each hit, so we should test the cost of 
doing so vs the cost of throwing an exception instead (eg we could define a 
typed exception in this new interface which means "abort the search now" and 
maybe another to mean "stop searching & return the results you got so far", 
etc.).

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-12 Thread Wolf Siberski (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655944#action_12655944
 ] 

Wolf Siberski commented on LUCENE-1473:
---

Thanks to Doug and Jason for your constructive feedback. Let me first clarify 
the purpose and scope of the patch. IMHO, the discussion about Serialization in 
Lucene is not clear-cut at all. My opinion is that moving all 
distribution-related code out of the core leads to a cleaner separation of 
concerns and thus is better design. On the other hand with removing 
Serializable we limit the Lucene application space at least a bit (e.g., no 
support for dynamic class loading), and abandon the advantages default Java 
serialization offers. Therefore the patch is to be taken as contribution to 
explore the design space (as Michaels patch on custom readers explored the 
Serializable option), and not as a full-fledged solution proposal.

> [Doug] The removal of Serializeable will break compatibility, so must be 
> well-advertised.
Sure. I removed Serializable to catch all related errors; this was not meant as 
proposal for a final patch.

>  [Doug] The Searchable API was designed for remote use and does not include 
> HitCollector-based access.
Currently Searchable does include a HitCollector-based search method, although 
the comment says that 'HitCollector-based access to remote indexes is 
discouraged'. The only reason to provide an implementation is that I wanted to 
keep the Searchable contract. Is remote access the only purpose of 
Searchable/MultiSearcher? Is it ok to break compatibility with respect to these 
classes? IMHO a significant fraction of the current clumsiness in the remote 
package stems from my attempt to fully preserve the Searchable API.
 
>  [Doug] Weighting, and hence ranking, does not appear to be implemented 
> correctly by this patch. 
True, I was a bit too fast here. We could either solve it along the line you 
propose, or revert to pass the Weight again instead of the Query. The issue 
IMHO is orthogonal to the Serializable discussion and more related to the 
question how a good remote search interface and protocol should look like.

> [Jason] Restricting people to XML will probably not be suitable though.
The patch does not limit serialization to XML. It just requires that encoding 
to and decoding from String is implemented, no matter how. I used XML/XStream 
as proof-of-concept implementation, but don't propose to make XML mandatory. 
The main reason for introduction of the Serializer interface was to emphasize 
that XML/XStream is just one implemantation option. Actually, the current 
approach feels like at least one indirection more than required; for a final 
solution I would try to come up with a better design.

> [Jason] It seems the alternative solutions to serialization simply shift the 
> problem around but do not really solve 
> the underlying issues (speed, versioning, writing custom serialization code, 
> and perhaps dynamic classloading).
In a sense, the problem is indeed 'only' shifted around and not yet solved. The 
good thing about this shift is that Lucene core becomes decoupled from these 
issues. The only real limitation I see is that dynamic classloading can't be 
realized anymore. 

With respect to speed, I don't think that encoding/decoding is a significant 
performance factor in distributed search, but this would need to be 
benchmarked. With respect to versioning, my patch still keeps all options open. 
What is more important, Lucene users can now decide if they need compatibility 
between different versions, and roll their own encoding/decoding if they need 
it. Of course, if they are willing to contribute and maintain custom 
serializers which preserve back compatibility, they can do it in contrib as 
well as they could have done it in the core. Custom serialization is still 
possible although the standard Java serialization framework can't be used 
anymore for that purpose, and I admit that this is a disadvantage.

> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch, 
> LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch, 
> lucene-contrib-remote.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is autom

[jira] Updated: (LUCENE-1490) CJKTokenizer convert HALFWIDTH_AND_FULLWIDTH_FORMS wrong

2008-12-12 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1490:
---

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9

(Adding trunk release (2.9) to fix version too)

> CJKTokenizer convert   HALFWIDTH_AND_FULLWIDTH_FORMS wrong
> --
>
> Key: LUCENE-1490
> URL: https://issues.apache.org/jira/browse/LUCENE-1490
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Daniel Cheng
> Fix For: 2.4, 2.9
>
>
> CJKTokenizer have these lines..
> if (ub == 
> Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }
> This is wrong. Some character in the block (e.g. U+ff68) have no BASIC_LATIN 
> counterparts.
> Only 65281-65374 can be converted this way.
> The fix is
>  if (ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS 
> && i <= 65474 && i> 65281) {
> /** convert  HALFWIDTH_AND_FULLWIDTH_FORMS to BASIC_LATIN 
> */
> int i = (int) c;
> i = i - 65248;
> c = (char) i;
> }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Mark Miller

Michael McCandless wrote:


Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API to java
> 5 for the 3.0 release, cause I thought back compat was less restricted
> 2-3. I guess mabye that won't end up happening, if it was going to, it
> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the deprecated
non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have too
many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.

Mike
Okay, that makes sense. I guess we have to give something to move to if 
we deprecate ;)


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Ryan McKinley


On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:



Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API to  
java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the deprecated
non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have too
many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.




What are examples of the deprecated non-generic APIs?

My understanding would be that in 2.9 we have:
 public void function( List list );
and in 3.0
 public void function( List list );

How do you keep both functions around?

ryan

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655976#action_12655976
 ] 

Michael McCandless commented on LUCENE-1483:


Duh: I just realized that when we switched back to a single pqueue for
gathering results across the N subreaders, we lost the original
intended "benefit" for this issue.  Hard to keep the forrest in mind
when looking at all the trees

Ie, we are now (again) creating a single FieldSortedHitQueue, which
pulls the FieldCache for the entire MultiReader, not per-segment.  So
warming time is still slow, when sorting by fields.

Really we've "stumbled" on 2 rather different optimizations:

  # Run Scorer at the "sub reader" level: this gains performance
because you save the cost of going through a MultiReader.  This
requires the new DocCollector class, so we can setDocBase(...).
  # Do collection (sort comparison w/ pqueue) at the "sub reader"
level: this gains warming performance because we only ask for
FieldCache for each subreader.  But, it seems to hurt search
performance (pqueue comparison & insertion cost went up), so it's
no longer a no-brainer tradeoff (by default at least).

Given that #1 has emerged as a tentatively fairly compelling gain, I
now think we should decouple it from #2.  Even though #2 was the
original intent here, let's now morph this issue into addressing #1
(since that's what current patch does), and I'll open a new issue for
#2?


> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Michael McCandless


Ryan McKinley wrote:



On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:



Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API  
to java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the  
deprecated

non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have  
too

many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.




What are examples of the deprecated non-generic APIs?

My understanding would be that in 2.9 we have:
public void function( List list );
and in 3.0
public void function( List list );

How do you keep both functions around?


We'd have to change the name?  Or deprecate the whole class containing  
these methods (if there are lots of methods to deprecate)?  Definitely  
something of a hassle.


Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Shai Erera
I wonder why do we even have to deprecate ...
A method like public void function( List list ) changes nothing in
terms of API. When people will move to 3.0, they'll have to change their JDK
anyway to 5 (if they haven't already done so). Which means they had code
like:
function(List), and where List was not defined as generics. But they'll get
a warning anyway by the compiler, when they define List, that it's not safe
to create a list w/o defining its type.

I think that when you move to 5 you have to change a lot of your code
anyway, so simply changing the Lucene API will not create too much of a
hassle for existing applications.

Personally I'd hate to find out I have to change my entire application
because method/classes names were changed.

Shai

On Fri, Dec 12, 2008 at 1:44 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

>
> Ryan McKinley wrote:
>
>
>> On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:
>>
>>
>>> Taking this to java-dev (off Jira)...
>>>
>>> Mark Miller (Jira) wrote:
>>>
>>> > I thought there were some that wanted to change some of the API to java
>>> > 5 for the 3.0 release, cause I thought back compat was less restricted
>>> > 2-3. I guess mabye that won't end up happening, if it was going to, it
>>> > seems we'd want to deprecate what will be changed in 2.9.
>>>
>>> I could easily be confused on this... but I thought 3.0 is the first
>>> release that's allowed to include Java 1.5 only APIs (eg generics).
>>>
>>> Meaning, we could in theory intro APIs with generics with 3.0,
>>> deprecating the non-generics versions, and then 4.0 (sounds insanely
>>> far away!) would be the first release that could remove the deprecated
>>> non-generics versions?
>>>
>>> That said, I think the "plan" is to release 2.9 soonish (early next
>>> year?), and then fairly quickly turnaround a 3.0 that doesn't have too
>>> many changes except the removal of the deprecated (in 2.9) APIs.  Ie
>>> in practice it won't be until 3.1 when we would intro new
>>> (generics-based) APIs.
>>>
>>>
>>
>> What are examples of the deprecated non-generic APIs?
>>
>> My understanding would be that in 2.9 we have:
>> public void function( List list );
>> and in 3.0
>> public void function( List list );
>>
>> How do you keep both functions around?
>>
>
> We'd have to change the name?  Or deprecate the whole class containing
> these methods (if there are lots of methods to deprecate)?  Definitely
> something of a hassle.
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Michael McCandless


I can certainly see the benefit/temptation of the alternative "big  
bang" approach.


It's just not clear to me (yet) which way (big bang or not) we're  
planning to go, with 3.x.


Mike

Shai Erera wrote:


I wonder why do we even have to deprecate ...
A method like public void function( List list ) changes  
nothing in terms of API. When people will move to 3.0, they'll have  
to change their JDK anyway to 5 (if they haven't already done so).  
Which means they had code like:
function(List), and where List was not defined as generics. But  
they'll get a warning anyway by the compiler, when they define List,  
that it's not safe to create a list w/o defining its type.


I think that when you move to 5 you have to change a lot of your  
code anyway, so simply changing the Lucene API will not create too  
much of a hassle for existing applications.


Personally I'd hate to find out I have to change my entire  
application because method/classes names were changed.


Shai

On Fri, Dec 12, 2008 at 1:44 PM, Michael McCandless > wrote:


Ryan McKinley wrote:


On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:


Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API to  
java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the deprecated
non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have too
many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.



What are examples of the deprecated non-generic APIs?

My understanding would be that in 2.9 we have:
public void function( List list );
and in 3.0
public void function( List list );

How do you keep both functions around?

We'd have to change the name?  Or deprecate the whole class  
containing these methods (if there are lots of methods to  
deprecate)?  Definitely something of a hassle.


Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655984#action_12655984
 ] 

Mark Miller commented on LUCENE-1483:
-

Ugg...you know I was afraid of that when I was making the change, but I easily 
convinced myself that FieldSortedHitQueue was just taking that Reader for AUTO 
detect and didnt really relook. It also makes the comparators. Bummer. I guess 
lets open a new issue if we can't easily deal with it here (I've got to look at 
it some more).

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655988#action_12655988
 ] 

Michael McCandless commented on LUCENE-831:
---


{quote}
> At present, KS only caches the docID -> ord map as an array. It builds that
> array by iterating over the terms in the sort field's Lexicon and mapping the
> docIDs from each term's posting list.
{quote}

OK, that corresponds to the "order" array in Lucene's
FieldCache.StringIndex class.

{quote}
> Building the docID -> ord array is straightforward for a single-segment
> SegLexicon. The multi-segment case requires that several SegLexicons be
> collated using a priority queue. In KS, there's a MultiLexicon class which
> handles this; I don't believe that Lucene has an analogous class.
{quote}

Lucene achieves the same functionality by using a MultiReader to read
the terms in order (which uses MultiSegmentReader.MultiTermEnum, which
uses a pqueue under the hood) and building up StringIndex from that.
It's very costly.

{quote}
> Relying on the docID -> ord array alone works quite well until you get to the
> MultiSearcher case. As you know, at that point you need to be able to
> retrieve the actual field values from the ordinal numbers, so that you can
> compare across multiple searchers (since the ordinal values are meaningless).
{quote}

Right, and we are trying to move towards pushing searcher down to the
segment.  Then we can use the per-segment ords for within-segment
collection, and then the real values for merging the separate pqueues
at the end (but, initial results from LUCENE-1483 show that collecting
N queues then merging in the end adds ~20% slowdown for N = 100
segments).

{quote}
> Lex_Seek_By_Num(lexicon, term_num);
> field_val = Lex_Get_Term(lexicon);
> 
> The problem is that seeking by ordinal value on a MultiLexicon iterator
> requires a gnarly implementation and is very expensive. I got it working, but
> I consider it a dead-end design and a failed experiment.
{quote}

OK.

{quote}
> The planned replacement for these iterator-based quasi-FieldCaches involves
> several topics of recent discussion:
> 
> 1) A "keyword" field type, implemented using a format similar to what Nate
> and I came up with for the lexicon index.
> 2) Write per-segment docID -> ord maps at index time for sort fields.
> 3) Memory mapping.
> 4) Segment-centric searching.
> 
> We'd mmap the pre-composed docID -> ord map and use it for intra-segment
> sorting. The keyword field type would be implemented in such a way that we'd
> be able to mmap a few files and get a per-segment field cache, which we'd then
> use to sort hits from multiple segments.
{quote}

OK so your "keyword" field type would expose random-access to field
values by docID, to be used to merge the N segments' pqueues into a
single final pqueue?

The alternative is to use iterator but pull the values into your
pqueues when they are inserted.  The benefit is iterator-only
exposure, but the downside is likely higher net cost of insertion.
And if the "assumption" is these fields can generally be ram resident
(explicitly or via mmap), then the net benefit of iterator-only API is
not high.


> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached data from subReaders).
> 2) Provide backwards compatibility to support existing FieldCache API with
> the new implementation, so there is no redundent caching as client code
> migrades to new API.

-- 
This message is automatically generated 

[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655996#action_12655996
 ] 

Mark Miller commented on LUCENE-1483:
-

I've got a quick idea I want to try to fix it.

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Grant Ingersoll
We agreed in the vote that we would allow generics, etc. in 3.0  
including the removal of non-generic versions of the same methods.  In  
other words, we aren't strictly following the way we went from 1.9 to  
2.0.


I sent a thread on 2.9/3.0 planning a while ago, but got no responses...


On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:



Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API to  
java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the deprecated
non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have too
many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.

Mike


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Grant Ingersoll
IIRC, we also agreed that we didn't feel any compelling reason to make  
a sweeping change to generics, but would likely just add them as we  
see 'em, unless of course someone wants to do a wholesale patch.  In  
the case of generics, I see no reason why we can't intro them over  
time, people using the non-generic forms will still work.


On Dec 12, 2008, at 7:44 AM, Grant Ingersoll wrote:

We agreed in the vote that we would allow generics, etc. in 3.0  
including the removal of non-generic versions of the same methods.   
In other words, we aren't strictly following the way we went from  
1.9 to 2.0.


I sent a thread on 2.9/3.0 planning a while ago, but got no  
responses...



On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:



Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API  
to java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the  
deprecated

non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have  
too

many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.

Mike




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Grant Ingersoll

See also http://wiki.apache.org/lucene-java/Java_1.5_Migration

On Dec 12, 2008, at 7:44 AM, Grant Ingersoll wrote:

We agreed in the vote that we would allow generics, etc. in 3.0  
including the removal of non-generic versions of the same methods.   
In other words, we aren't strictly following the way we went from  
1.9 to 2.0.


I sent a thread on 2.9/3.0 planning a while ago, but got no  
responses...



On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:



Taking this to java-dev (off Jira)...

Mark Miller (Jira) wrote:

> I thought there were some that wanted to change some of the API  
to java
> 5 for the 3.0 release, cause I thought back compat was less  
restricted
> 2-3. I guess mabye that won't end up happening, if it was going  
to, it

> seems we'd want to deprecate what will be changed in 2.9.

I could easily be confused on this... but I thought 3.0 is the first
release that's allowed to include Java 1.5 only APIs (eg generics).

Meaning, we could in theory intro APIs with generics with 3.0,
deprecating the non-generics versions, and then 4.0 (sounds insanely
far away!) would be the first release that could remove the  
deprecated

non-generics versions?

That said, I think the "plan" is to release 2.9 soonish (early next
year?), and then fairly quickly turnaround a 3.0 that doesn't have  
too

many changes except the removal of the deprecated (in 2.9) APIs.  Ie
in practice it won't be until 3.1 when we would intro new
(generics-based) APIs.

Mike



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655963#action_12655963
 ] 

Uwe Schindler commented on LUCENE-1483:
---

{quote}bq. adding a return type to the collect/hit method? ... ie: an enum 
style result indicating "OK" or "ABORT" (with the potential of adding 
additional constants later ala FieldSelectorResult)

I think we should consider this, though this then implies an if stmt checking 
the return result & doing something, on each hit, so we should test the cost of 
doing so vs the cost of throwing an exception instead (eg we could define a 
typed exception in this new interface which means "abort the search now" and 
maybe another to mean "stop searching & return the results you got so far", 
etc.).{quote}

This looks like a really good idea. Currently to stop a iterator, I use an 
exception class that extends RuntimeException (to have it unchecked) to cancel 
a search. Very nice if you support it directly.

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656007#action_12656007
 ] 

Mark Miller commented on LUCENE-1483:
-

Bah. You can't share that queue and get the reopen benefit without jumping too 
many hoops. All of a sudden you can't use ordinals, comparators need to know 
how to compare across comparators, and it just breaks down fast. How 
disappointing.

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656014#action_12656014
 ] 

Michael McCandless commented on LUCENE-1483:



Yeah, ugg.  This is the nature of "progress"!  It's not exactly a
straight line from point A to B :) Lots of fits & starts, dead ends,
jumps, etc.

We could simply offer both ("collect into single pqueue but pay high
warming cost" or "collect into separate pqueues, then merge, and pay
low warming cost"), but that sure is an annoying choice to have to
make.

Oh, here's another idea: do separate pqueues (again!), but after the
first segment is done, grab the values for the worst scoring doc in
the pqueue (assuming the queue filled up to its numHits) and use this
as the "cutoff" before inserting into the next segment's pqueue.

In grabbing that cutoff we'd have to 1) map ord->value for segment 1,
then 2) map value->ord for segment 2, then 3) use that cutoff for
segment 2.  (And likewise for all segment N -> N+1).

I think this'd greatly reduce the number of inserts & comparisons done
in subsequent queues because it mimics how a single pqueue behaves:
you don't bother re-considering hits that won't be globally
competitive.

We could also maybe merge after each segment is processed; that way
the cutoff we carry to the next segment is "true" so we'd reduce
comparisons even further.

Would this work?  Let's try to think hard before writing code :)


> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656016#action_12656016
 ] 

Mark Miller commented on LUCENE-1483:
-

bq. We could simply offer both ("collect into single pqueue but pay high 
warming cost" or "collect into separate pqueues, then merge, and pay low 
warming cost"), but that sure is an annoying choice to have to make.

Agreed. I really hope we don't have to settle for this.

bq. Oh, here's another idea:

Good one! Keep those ideas coming.

bq. Would this work?

It sounds like you've nailed it to me, but I'll let it float around in my head 
for a bit while I work on some other things.

bq. Let's try to think hard before writing code :)

Now theres a new concept for me. My brain will work itself to death trying to 
avoid real work :)

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1466) CharFilter - normalize characters before tokenizer

2008-12-12 Thread Koji Sekiguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-1466:
---

  Description: 
This proposes to import CharFilter that has been introduced in Solr 1.4.

Please see for the details:
- SOLR-822
- http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

  was:
This proposes to import CharFilter that has been introduced in Solr 1.4.

Please see for the details:
SOLR-822
http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

Lucene Fields: [New, Patch Available]  (was: [New])

> CharFilter - normalize characters before tokenizer
> --
>
> Key: LUCENE-1466
> URL: https://issues.apache.org/jira/browse/LUCENE-1466
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Analysis
>Affects Versions: 2.4
>Reporter: Koji Sekiguchi
>Priority: Minor
> Attachments: LUCENE-1466.patch
>
>
> This proposes to import CharFilter that has been introduced in Solr 1.4.
> Please see for the details:
> - SOLR-822
> - http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SVN karma problem?

2008-12-12 Thread Karl Wettin
Everything worked great when I switched from svn.eu.apache.org to  
svn.apache.org. I suppose I should report that to someone. Infra?


12 dec 2008 kl. 00.13 skrev Grant Ingersoll:


http://www.nabble.com/Committing-new-files-to-(write-through-proxy)-slave-repo-fails---400-Bad-Request-td20083914.html

Any of that ring a bell?

On Dec 11, 2008, at 5:49 PM, Karl Wettin wrote:

I tried clean checkout, upgraded my SVN client and a bunch of other  
things. I could try to add and remove an alternative dummy file.


11 dec 2008 kl. 23.35 skrev Grant Ingersoll:


Does an svn cleanup help?  What about on a clean checkout?

On Dec 11, 2008, at 5:13 PM, Karl Wettin wrote:

I can't seem to commit new files in contrib, only update  
existing. Or am I misinterpreting the error?


svn: Commit failed (details follow):
svn: Server sent unexpected return value (400 Bad Request) in  
response to PROPFIND request for '/repos/asf/!svn/wrk/d81a2cce- 
e749-4cd0-a609-6e2a3763b81d/lucene/java/trunk/contrib/ 
instantiated/src/test/org/apache/lucene/store/instantiated/ 
TestSerialization.java'

svn: Your commit message was left in a temporary file:
svn:'/Users/kalle/projekt/apache/lucene/trunk/svn-commit.tmp'



  karl

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1462) Instantiated/IndexWriter discrepanies

2008-12-12 Thread Karl Wettin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin closed LUCENE-1462.
---

Resolution: Fixed

Committed in r726030 and r 725837.

> Instantiated/IndexWriter discrepanies
> -
>
> Key: LUCENE-1462
> URL: https://issues.apache.org/jira/browse/LUCENE-1462
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.4
>Reporter: Karl Wettin
>Assignee: Karl Wettin
>Priority: Critical
> Fix For: 2.9
>
> Attachments: LUCENE-1462.txt
>
>
>  * RAMDirectory seems to do a reset on tokenStreams the first time, this 
> permits to initialise some objects before starting streaming, 
> InstantiatedIndex does not.
>  * I can Serialize a RAMDirectory but I cannot on a InstantiatedIndex because 
> of : java.io.NotSerializableException: 
> org.apache.lucene.index.TermVectorOffsetInfo
> http://www.nabble.com/InstatiatedIndex-questions-to20576722.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: TestComplexPhraseQuery.java

More tests for Nots

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: ComplexPhraseQueryParser.java

Added support for "Nots" in phrase queries e.g. "-not interested"

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: ComplexPhraseQueryParser.java)

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries

2008-12-12 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-1486:
-

Attachment: (was: TestComplexPhraseQuery.java)

> Wildcards, ORs etc inside Phrase queries
> 
>
> Key: LUCENE-1486
> URL: https://issues.apache.org/jira/browse/LUCENE-1486
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.4
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 2.4.1, 2.9
>
> Attachments: ComplexPhraseQueryParser.java, 
> TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of 
> PhraseQueries to allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in 
> QueryParser itself. This works as a proof of concept  for much of the query 
> parser syntax. Examples from the Junit test include:
>   checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies 
> are OK in phrases
>   checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic 
> works
>   checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic 
> works.
>   
>   checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a 
> phrase is bad
>   checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases 
> is bad
>   checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries 
> inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1488) issues with standardanalyzer on multilingual text

2008-12-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656040#action_12656040
 ] 

Robert Muir commented on LUCENE-1488:
-

as soon as I figure out how to invoke the ICU RBBI compiler i'll see if i can 
update the patch with compiled rules so instantiation of this thing is cheap...

> issues with standardanalyzer on multilingual text
> -
>
> Key: LUCENE-1488
> URL: https://issues.apache.org/jira/browse/LUCENE-1488
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: contrib/analyzers
>Reporter: Robert Muir
>Priority: Minor
> Attachments: ICUAnalyzer.patch
>
>
> The standard analyzer in lucene is not exactly unicode-friendly with regards 
> to breaking text into words, especially with respect to non-alphabetic 
> scripts.  This is because it is unaware of unicode bounds properties.
> I actually couldn't figure out how the Thai analyzer could possibly be 
> working until i looked at the jflex rules and saw that codepoint range for 
> most of the Thai block was added to the alphanum specification. defining the 
> exact codepoint ranges like this for every language could help with the 
> problem but you'd basically be reimplementing the bounds properties already 
> stated in the unicode standard. 
> in general it looks like this kind of behavior is bad in lucene for even 
> latin, for instance, the analyzer will break words around accent marks in 
> decomposed form. While most latin letter + accent combinations have composed 
> forms in unicode, some do not. (this is also an issue for asciifoldingfilter 
> i suppose). 
> I've got a partially tested standardanalyzer that uses icu Rule-based 
> BreakIterator instead of jflex. Using this method you can define word 
> boundaries according to the unicode bounds properties. After getting it into 
> some good shape i'd be happy to contribute it for contrib but I wonder if 
> theres a better solution so that out of box lucene will be more friendly to 
> non-ASCII text. Unfortunately it seems jflex does not support use of these 
> properties such as [\p{Word_Break = Extend}] so this is probably the major 
> barrier.
> Thanks,
> Robert

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1487) FieldCacheTermsFilter

2008-12-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656043#action_12656043
 ] 

Yonik Seeley commented on LUCENE-1487:
--

I think the name should be different since it only works with single-valued 
fields, unlike other TermFilters and TermQueries.

> FieldCacheTermsFilter
> -
>
> Key: LUCENE-1487
> URL: https://issues.apache.org/jira/browse/LUCENE-1487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Tim Sturge
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: FieldCacheTermsFilter.java
>
>
> This is a companion to FieldCacheRangeFilter except it operates on a set of 
> terms rather than a range. It works best when the set is comparatively large 
> or the terms are comparatively common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1473) Implement standard Serialization across Lucene versions

2008-12-12 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656071#action_12656071
 ] 

Doug Cutting commented on LUCENE-1473:
--

> Therefore the patch is to be taken as contribution to explore the design 
> space [ ... ]

Yes, and it is much appreciated for that.  Thanks again!

> Currently Searchable does include a HitCollector-based search method [ ... ]

You're right.  I misremembered.  This dates back to the origin of Searchable.

http://svn.apache.org/viewvc?view=rev&revision=149813

Personally, I think it would be reasonable for a distributed implementation to 
throw an exception if one tries to use a HitCollector.

> We could either solve it along the line you propose, or revert to pass the 
> Weight again instead of the Query.

Without using an introspection-based serialization like Java serialization it 
would be difficult to pass a Weight over the wire using public APIs, since most 
implementations are not public.  But, since Weight's are constructed via a 
standard protocol, the method I outlined could work.


> Implement standard Serialization across Lucene versions
> ---
>
> Key: LUCENE-1473
> URL: https://issues.apache.org/jira/browse/LUCENE-1473
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Search
>Affects Versions: 2.4
>Reporter: Jason Rutherglen
>Priority: Minor
> Attachments: custom-externalizable-reader.patch, LUCENE-1473.patch, 
> LUCENE-1473.patch, LUCENE-1473.patch, LUCENE-1473.patch, 
> lucene-contrib-remote.patch
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> To maintain serialization compatibility between Lucene versions, 
> serialVersionUID needs to be added to classes that implement 
> java.io.Serializable.  java.io.Externalizable may be implemented in classes 
> for faster performance.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656073#action_12656073
 ] 

Doug Cutting commented on LUCENE-1483:
--

>   public abstract void setBase(int base);

It occurred to me last night that this really has no place in HitCollector.  
We're forcing applications to handle an implementation detail that they really 
shouldn't have to know about.  It would be better to pass the base down to the 
scorer implementations and have them add it on before they call collect(), no?


> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656079#action_12656079
 ] 

Michael McCandless commented on LUCENE-1483:


{quote}
> It would be better to pass the base down to the scorer implementations and 
> have them add it on before they call collect(), no?
{quote}

So we'd add Scorer.setDocBase instead?

The only downside I can think of here is that often you will perform the 
addition when it wasn't necessary.

Ie, if the score is not competitive at all, then you wouldn't need to create 
the full docID and so you'd save one add opcode.

Admittedly, this is a very small (tiny) cost, and I do agree that making 
HitCollector know about docBase is really an abstraction violation...

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: SVN karma problem?

2008-12-12 Thread Grant Ingersoll


On Dec 12, 2008, at 10:13 AM, Karl Wettin wrote:

Everything worked great when I switched from svn.eu.apache.org to  
svn.apache.org. I suppose I should report that to someone. Infra?


Yes.  Infra.

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Jason Rutherglen
Decoupling IndexReader would for 3.0 would be great.  This includes making
public SegmentReader, MultiSegmentReader.

A constructor like new SegmentReader(TermsDictionary termDictionary,
TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet deletedDocs);

Where each class is abstract and can be implemented in an optional way.

Decouple rollback, commit, IndexDeletionPolicy from DirectoryIndexReader
into a class like SegmentsVersionSystem which could act as the controller
for reopen types of methods.  There could be a SegmentVersionSystem that
manages the versioning of a single segment.

I'd rather figure out these things before worrying too much about generics
which although nice for being able to read the code, doesn't matter if the
code changes dramatically and is deprecated.

> of the alternative "big bang" approach.

Is this the type of thing you mean by the "big bang" approach?

On Fri, Dec 12, 2008 at 4:44 AM, Grant Ingersoll wrote:

> We agreed in the vote that we would allow generics, etc. in 3.0 including
> the removal of non-generic versions of the same methods.  In other words, we
> aren't strictly following the way we went from 1.9 to 2.0.
>
> I sent a thread on 2.9/3.0 planning a while ago, but got no responses...
>
>
> On Dec 12, 2008, at 5:18 AM, Michael McCandless wrote:
>
>
>> Taking this to java-dev (off Jira)...
>>
>> Mark Miller (Jira) wrote:
>>
>> > I thought there were some that wanted to change some of the API to java
>> > 5 for the 3.0 release, cause I thought back compat was less restricted
>> > 2-3. I guess mabye that won't end up happening, if it was going to, it
>> > seems we'd want to deprecate what will be changed in 2.9.
>>
>> I could easily be confused on this... but I thought 3.0 is the first
>> release that's allowed to include Java 1.5 only APIs (eg generics).
>>
>> Meaning, we could in theory intro APIs with generics with 3.0,
>> deprecating the non-generics versions, and then 4.0 (sounds insanely
>> far away!) would be the first release that could remove the deprecated
>> non-generics versions?
>>
>> That said, I think the "plan" is to release 2.9 soonish (early next
>> year?), and then fairly quickly turnaround a 3.0 that doesn't have too
>> many changes except the removal of the deprecated (in 2.9) APIs.  Ie
>> in practice it won't be until 3.1 when we would intro new
>> (generics-based) APIs.
>>
>> Mike
>>
>
> --
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Updated: (LUCENE-1378) Remove remaining @author references

2008-12-12 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-1378:
-

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Affects Version/s: 2.9

Reopened, so fix 2.9 instead of 2.4.
Or should I rather open a new issue?

> Remove remaining @author references
> ---
>
> Key: LUCENE-1378
> URL: https://issues.apache.org/jira/browse/LUCENE-1378
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Otis Gospodnetic
>Assignee: Otis Gospodnetic
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1378.patch, LUCENE-1378.patch, 
> LUCENE-1378b.patch, LUCENE-1378c.patch
>
>
> $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
> -pi -e 's/ \...@author.*//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1378) Remove remaining @author references

2008-12-12 Thread Paul Elschot (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Elschot updated LUCENE-1378:
-

Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Affects Version/s: (was: 2.9)
Fix Version/s: (was: 2.4)
   2.9

> Remove remaining @author references
> ---
>
> Key: LUCENE-1378
> URL: https://issues.apache.org/jira/browse/LUCENE-1378
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Otis Gospodnetic
>Assignee: Otis Gospodnetic
>Priority: Trivial
> Fix For: 2.9
>
> Attachments: LUCENE-1378.patch, LUCENE-1378.patch, 
> LUCENE-1378b.patch, LUCENE-1378c.patch
>
>
> $ find . -name \*.java | xargs grep '@author' | cut -d':' -f1 | xargs perl 
> -pi -e 's/ \...@author.*//'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656089#action_12656089
 ] 

Mark Miller commented on LUCENE-1483:
-

bq. Oh, here's another idea: do separate pqueues (again!), but after the first 
segment is done, grab the values for the worst scoring doc in the pqueue 
(assuming the queue filled up to its numHits) and use this as the "cutoff" 
before inserting into the next segment's pqueue.

We've got to try it. Whats the hard part in this? Converting a value to an ord?

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656089#action_12656089
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 12/12/08 10:24 AM:


bq. Oh, here's another idea: do separate pqueues (again!), but after the first 
segment is done, grab the values for the worst scoring doc in the pqueue 
(assuming the queue filled up to its numHits) and use this as the "cutoff" 
before inserting into the next segment's pqueue.

We've got to try it. Whats the hard part in this? Converting a value to an ord?

*EDIT*

Okay, I see, we can just find our place by running through new value 
Comparables.

An added cost for going back to per reader is that all doc id values (not ords) 
also need to be adjusted (for the multisearcher).

  was (Author: markrmil...@gmail.com):
bq. Oh, here's another idea: do separate pqueues (again!), but after the 
first segment is done, grab the values for the worst scoring doc in the pqueue 
(assuming the queue filled up to its numHits) and use this as the "cutoff" 
before inserting into the next segment's pqueue.

We've got to try it. Whats the hard part in this? Converting a value to an ord?
  
> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: 2.9/3.0 plan & Java 1.5

2008-12-12 Thread Doug Cutting

Jason Rutherglen wrote:
Decoupling IndexReader would for 3.0 would be great.  This includes 
making public SegmentReader, MultiSegmentReader. 

A constructor like new SegmentReader(TermsDictionary termDictionary, 
TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet deletedDocs);


Where each class is abstract and can be implemented in an optional way. 

Decouple rollback, commit, IndexDeletionPolicy from DirectoryIndexReader 
into a class like SegmentsVersionSystem which could act as the 
controller for reopen types of methods.  There could be a 
SegmentVersionSystem that manages the versioning of a single segment. 


Can't this stuff be rolled out as new features in 3.0?  The important 
thing to do now is figure out what can be dropped when we go to 3.0, not 
what might be added after.


I'd rather figure out these things before worrying too much about 
generics which although nice for being able to read the code, doesn't 
matter if the code changes dramatically and is deprecated. 


Folks are discussing whether generics are a special case for 
back-compatibility.  This is an important discussion, since major 
releases are defined by their back-compatibility.  This discussion thus 
should have priority over the discussion of new 3.0 features.


Doug

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1314) IndexReader.clone

2008-12-12 Thread Jason Rutherglen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Rutherglen updated LUCENE-1314:
-

Attachment: LUCENE-1314.patch

LUCENE-1314.patch

- Added TestIndexReaderCloneNorms because cloning the norms is really hard to 
do implemented as copy on write.  There seem to be many caveats such as whether 
or not the norms stream is still open?  testNormsRefCounting fails with a 
corruptedIndexException which I'm investigating.  

I now remember implementing a copy of just the bytes, and only editting them in 
the Norm object to get around these issues.  Basically on clone, new Norm 
objects and a Map is created but the byte array of the cloned norm is shared.  
On a write, the bytes are cloned.  This gets around needing to deal with the 
reader norm reference counting used by reopen, though is this a good idea?

I'll try that and see if it works.  Otherwise, suggestions besides hard cloning 
the norms for each clone?

> IndexReader.clone
> -
>
> Key: LUCENE-1314
> URL: https://issues.apache.org/jira/browse/LUCENE-1314
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Affects Versions: 2.3.1
>Reporter: Jason Rutherglen
>Assignee: Michael McCandless
>Priority: Minor
> Attachments: LUCENE-1314.patch, LUCENE-1314.patch, LUCENE-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, 
> lucene-1314.patch, lucene-1314.patch, lucene-1314.patch, lucene-1314.patch
>
>
> Based on discussion 
> http://www.nabble.com/IndexReader.reopen-issue-td18070256.html.  The problem 
> is reopen returns the same reader if there are no changes, so if docs are 
> deleted from the new reader, they are also reflected in the previous reader 
> which is not always desired behavior.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656132#action_12656132
 ] 

Michael McCandless commented on LUCENE-1483:


OK, here's another tweak on the last proposal: maybe we could,
instead, take the pqueue produced by segment 1 and "convert" it into
the ords matching segment 2, and then do normal searching for segment
2 using that single pqueue (and the same for all seg N -> N+1
transitions)?

For all numeric fields, the conversion is a no-op (their ord is
currently the actual numeric byte, short, int, etc. value, though
conceivably that could change in the future); only String fields, and
custom (hmm) would need to do something.

This should be more efficient than the cutoff approach because it'd
result in less comparisons/inserts.  Ie, it's exactly a single pqueue
again, just with some "conversion" between segments.  The conversion
cost is near zero for numeric fields, and for string fields it'd be
O(numHits*log2(numValue)), where numValue is number of unique string
values in next segment for that sort field.  I think for most cases
(many more docs than numHits requested) this would be faster than the
cutoff approach.

Would that work?

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1487) FieldCacheTermsFilter

2008-12-12 Thread Tim Sturge (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Sturge updated LUCENE-1487:
---

Attachment: FieldCacheTermsFilter.java

Reformatted version. I'm happy to change the name if that's the consensus but I 
can't think of any better alternatives right now.

> FieldCacheTermsFilter
> -
>
> Key: LUCENE-1487
> URL: https://issues.apache.org/jira/browse/LUCENE-1487
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4
>Reporter: Tim Sturge
>Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: FieldCacheTermsFilter.java, FieldCacheTermsFilter.java
>
>
> This is a companion to FieldCacheRangeFilter except it operates on a set of 
> terms rather than a range. It works best when the set is comparatively large 
> or the terms are comparatively common.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656136#action_12656136
 ] 

Yonik Seeley commented on LUCENE-1483:
--

segment 1 has terms:  apple, banana, orange
segment 2 has terms: apple, orange

What is the ord of banana in segment2?

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656144#action_12656144
 ] 

Michael McCandless commented on LUCENE-1483:


{quote}
> What is the ord of banana in segment2?
{quote}

How about 0.5?

Ie, we just need an ord that means it's in-between two ords for the current 
segment.

On encountering that, we'd also need to record it's real value so that 
subsequent segments could look it up properly (or, if it survives until the 
end, to return the correct value "banana").

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-831) Complete overhaul of FieldCache API/Implementation

2008-12-12 Thread Marvin Humphrey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656150#action_12656150
 ] 

Marvin Humphrey commented on LUCENE-831:


>> Building the docID -> ord array is straightforward for a single-segment
>> SegLexicon. The multi-segment case requires that several SegLexicons be
>> collated using a priority queue. In KS, there's a MultiLexicon class which
>> handles this; I don't believe that Lucene has an analogous class.
> 
> Lucene achieves the same functionality by using a MultiReader to read
> the terms in order (which uses MultiSegmentReader.MultiTermEnum, which
> uses a pqueue under the hood) and building up StringIndex from that.
> It's very costly.

Ah, you're right, that class is analogous.  The difference is that
MultiTermEnum doesn't implement seek(), let alone seekByNum().  I was pretty
sure you wouldn't have bothered, since by loading the actual term values into
an array you eliminate the need for seeking the iterator.

> OK so your "keyword" field type would expose random-access to field
> values by docID, 

Yes.  There would be three files for each keyword field in a segment.

  * docID -> ord map.  A stack of i32_t, one per doc.
  * Character data.  Each unique field value would be stored as uncompressed
UTF-8, sorted lexically (by default).
  * Term offsets.  A stack of i64_t, one per term plus one, demarcating the 
term text boundaries in the character data file.

Assuming that we've mmap'd those files -- or slurped them -- here's the
function to find the keyword value associated with a doc num:

{code}
void
KWField_Look_Up(KeyWordField *self, i32_t doc_num, ViewCharBuf *target)
{
if (doc_num > self->max_doc) {
CONFESS("Doc num out of range: %u32 %u32", 
}
else {
i64_t offset  = self->offsets[doc_num];
i64_t next_offset = self->offsets[doc_num + 1];
i64_t len = next_offset - offset;
ViewCB_Assign_Str(target, self->chardata + offset, len);
}
}
{code}

I'm not sure whether IndexReader.fetchDoc() should retrieve the values for
keyword fields by default, but I lean towards yes.  The locality isn't ideal,
but I don't think it'll be bad enough to contemplate storing keyword values
redundantly alongside the other stored field values.

> to be used to merge the N segments' pqueues into a
> single final pqueue?

Yes, although I think you only need one two priority queues total: one
dedicated to iterating intra-segment, which gets emptied out after each
seg into the other, final queue.

> The alternative is to use iterator but pull the values into your
> pqueues when they are inserted. The benefit is iterator-only
> exposure, but the downside is likely higher net cost of insertion.
> And if the "assumption" is these fields can generally be ram resident
> (explicitly or via mmap), then the net benefit of iterator-only API is
> not high.

If I understand where you're going, you'd like to apply the design of the
deletions iterator to this problem?

For that to work, we'd need to store values for each document, rather than
only unique values... right?  And they couldn't be stored in sorted order,
because we aren't pre-sorting the docs in the segment according to the value
of a keyword field -- which means string diffs don't help.  You'd have a
single file, with each doc's values encoded as a vbyte byte-count followed by
UTF-8 character data.

> Complete overhaul of FieldCache API/Implementation
> --
>
> Key: LUCENE-831
> URL: https://issues.apache.org/jira/browse/LUCENE-831
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Hoss Man
> Fix For: 3.0
>
> Attachments: ExtendedDocument.java, fieldcache-overhaul.032208.diff, 
> fieldcache-overhaul.diff, fieldcache-overhaul.diff, 
> LUCENE-831.03.28.2008.diff, LUCENE-831.03.30.2008.diff, 
> LUCENE-831.03.31.2008.diff, LUCENE-831.patch, LUCENE-831.patch, 
> LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch, LUCENE-831.patch
>
>
> Motivation:
> 1) Complete overhaul the API/implementation of "FieldCache" type things...
> a) eliminate global static map keyed on IndexReader (thus
> eliminating synch block between completley independent IndexReaders)
> b) allow more customization of cache management (ie: use 
> expiration/replacement strategies, disk backed caches, etc)
> c) allow people to define custom cache data logic (ie: custom
> parsers, complex datatypes, etc... anything tied to a reader)
> d) allow people to inspect what's in a cache (list of CacheKeys) for
> an IndexReader so a new IndexReader can be likewise warmed. 
> e) Lend support for smarter cache management if/when
> IndexReader.reopen is added (merging of cached dat

[jira] Created: (LUCENE-1491) EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.

2008-12-12 Thread Todd Feak (JIRA)
EdgeNGramTokenFilter stops on tokens smaller then minimum gram size.


 Key: LUCENE-1491
 URL: https://issues.apache.org/jira/browse/LUCENE-1491
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.4
Reporter: Todd Feak


If a token is encountered in the stream that is shorter in length than the min 
gram size, the filter will stop processing the token stream.

Working up a unit test now, but may be a few days before I can provide it. 
Wanted to get it in the system.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Pluggable IndexReader (was 2.9/3.0 plan & Java 1.5)

2008-12-12 Thread Marvin Humphrey

Doug Cutting:

> Folks are discussing whether generics are a special case for 
> back-compatibility.  This is an important discussion, since major 
> releases are defined by their back-compatibility.  This discussion thus 
> should have priority over the discussion of new 3.0 features.

Okeedoke.  Since I'm working on this right now for KS, though, I'd like to
continue the conversation under a new thread heading.

I have a bunch of file format changes to push through, and I'm hoping to
implement them using pluggable modules.  For instance, I'd like to be able to
swap out bit-vector-based deletions for tombstone-based deletions, just by
overriding a method or two.

Jason Rutherglen:

> Decoupling IndexReader would for 3.0 would be great.  This includes making
> public SegmentReader, MultiSegmentReader.

I definitely think that IndexReader can and should be made more pluggable.  Is
exposing per-segment sub-readers a definite win, though?  Does it make sense
to leave open the door to index components which don't operate on segments?
Or even to eliminate SegmentReader entirely and have sub-components of
IndexReader manage collation?

I've been thinking about this with regard to tombstone-based deletions, where
you can't know everything about a segment unless you've opened up other
segments.

> A constructor like new SegmentReader(TermsDictionary termDictionary,
> TermPostings termPostings, ColumnStrideFields csd, DocIdBitSet deletedDocs);

You end up with a proliferation of constructors that way.  Term vectors?
Arbitrary auxiliary components such as an R-tree component supporting
geographic search?

My original proposal to clean this up involved an "IndexComponent" class.
However, when I started implementing it, I ended up with a slew of new classes
with only two factory methods each.

We could possibly move those factory methods up into Schema, but I'm reluctant 
to
dirty it up, since it's a major public class in KS (as I anticipate it will be
in Lucy) and major public classes should be as simple as possible.

So, how about an IndexArchitecture or IndexPlan class?

  class MyArchitecture extends IndexArchitecture {
public PostingsWriter PostingsWriter() {
  return new PForDeltaPostingsWriter();
}
public PostingsReader PostingsReader() {
  return new PForDeltaPostingsReader();
}
public DeletionsWriter DeletionsWriter() {
  return new TombstoneWriter();
}
public DeletionsReader DeletionsReader() {
  return new TombstoneReader();
}
  }

Lucene:

  IndexWriter writer = new IndexWriter("/path/to/index", 
new StandardAnalyzer(), new MyArchitecture());

Lucy with Java bindings:

  class MySchema extends Schema {
public MySchema() {
  initField("title", "text");
  initField("content", "text");
}
public IndexArchitecture indexArchitecture() { 
  return new MyArchitecture(); 
}
public Analyzer analyzer() { 
  return new PolyAnalyzer("en"); 
}
  }

  IndexWriter writer = new IndexWriter(MySchema.open("/path/to/index"));

> Decouple rollback, commit, IndexDeletionPolicy from DirectoryIndexReader
> into a class like SegmentsVersionSystem which could act as the controller
> for reopen types of methods.  There could be a SegmentVersionSystem that
> manages the versioning of a single segment.

I like it. :)

Sometimes you want to change up the merge policy for different writers against
the same index.  How does that fit into your plan?

My thought is that merge-policies would be application-specific rather than
index-specific.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656206#action_12656206
 ] 

Mark Miller commented on LUCENE-1483:
-

Okay, but how am I going to squeeze between two customs? I guess you'd have to 
store as a compare against either side?

> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656206#action_12656206
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 12/12/08 4:32 PM:
---

Okay, but how am I going to squeeze between two customs? I guess you'd have to 
store as a compare against either side?

*EDIT*

There is also the problem that all compares are done based on ScoreDocs that 
index into a single ord array by doc. The previous pq's ScoreDocs will not 
compare right - they won't index into the ord array for the current Reader - 
they are indexes into the array for the previous Reader. This is what made me 
give up on single pq earlier.

  was (Author: markrmil...@gmail.com):
Okay, but how am I going to squeeze between two customs? I guess you'd have 
to store as a compare against either side?
  
> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



To clone or have a pluggable docidbitset for IndexReader

2008-12-12 Thread Jason Rutherglen
Hello,

In trying to figure out the best way to have a system for realtime whereby
the deletedDocs do not need to be saved there are two possible methods,
1) setting the DocIdBitSet manually (which breaks the saving and things,
but does not require doing norms cloning), or 2) implementing
IndexReader.clone
which requires deletedDocs and norms "copy on write".

The discussion about reopen (
https://issues.apache.org/jira/browse/LUCENE-743)
was lengthy and I can see from the code and the discussion why no one wants
to
revisit IndexReader.reopen in the form of IndexReader.clone and possibly
mess things up.

Is some alternative easier API possible that I'm missing?

-J


[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656206#action_12656206
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 12/12/08 6:03 PM:
---

Okay, but how am I going to squeeze between two customs? I guess you'd have to 
store as a compare against either side?

*EDIT*

There is also the problem that all compares are done based on ScoreDocs that 
index into a single ord array by doc. The previous pq's ScoreDocs will not 
compare right - they won't index into the ord array for the current Reader - 
they are indexes into the array for the previous Reader. This is what made me 
give up on single pq earlier.

*EDIT*

I guess we put them on the ScoreDoc like we do the values for multisearcher? 
Then we could use a PQ like FieldDocPQ that used ords rather than vals?

  was (Author: markrmil...@gmail.com):
Okay, but how am I going to squeeze between two customs? I guess you'd have 
to store as a compare against either side?

*EDIT*

There is also the problem that all compares are done based on ScoreDocs that 
index into a single ord array by doc. The previous pq's ScoreDocs will not 
compare right - they won't index into the ord array for the current Reader - 
they are indexes into the array for the previous Reader. This is what made me 
give up on single pq earlier.
  
> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1483) Change IndexSearcher to use MultiSearcher semantics for multiple subreaders

2008-12-12 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12656206#action_12656206
 ] 

markrmil...@gmail.com edited comment on LUCENE-1483 at 12/12/08 6:35 PM:
---

Okay, but how am I going to squeeze between two customs? I guess you'd have to 
store as a compare against either side?

*EDIT*

There is also the problem that all compares are done based on ScoreDocs that 
index into a single ord array by doc. The previous pq's ScoreDocs will not 
compare right - they won't index into the ord array for the current Reader - 
they are indexes into the array for the previous Reader. This is what made me 
give up on single pq earlier.

*EDIT*

I guess we put them on the ScoreDoc like we do the values for multisearcher? 
Then we could use a PQ like FieldDocPQ that used ords rather than vals?

*EDIT*

Hmmm...How do I get at the ordinals though? The value is exposed, but the 
ordinals are hidden behind a compare method...

  was (Author: markrmil...@gmail.com):
Okay, but how am I going to squeeze between two customs? I guess you'd have 
to store as a compare against either side?

*EDIT*

There is also the problem that all compares are done based on ScoreDocs that 
index into a single ord array by doc. The previous pq's ScoreDocs will not 
compare right - they won't index into the ord array for the current Reader - 
they are indexes into the array for the previous Reader. This is what made me 
give up on single pq earlier.

*EDIT*

I guess we put them on the ScoreDoc like we do the values for multisearcher? 
Then we could use a PQ like FieldDocPQ that used ords rather than vals?
  
> Change IndexSearcher to use MultiSearcher semantics for multiple subreaders
> ---
>
> Key: LUCENE-1483
> URL: https://issues.apache.org/jira/browse/LUCENE-1483
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.9
>Reporter: Mark Miller
>Priority: Minor
> Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch
>
>
> FieldCache and Filters are forced down to a single segment reader, allowing 
> for individual segment reloading on reopen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org