date:20071128

Re: First cut at web-based Luke for contrib

2007-11-28 Thread markharw00d

The 17 MB bundle I provided is essentially the source plus dependencies, 
the bulk of which is jars, mainly the compile-time dependency 
gwt-dev-windows.jar weighing in at 10MB.


The built WAR file is only 1.5 meg.
The WAR file bundled with Jetty (as a convenience) is 8 meg.

It may be possible to use Proguard or something like that to try slim 
down the gwt-dev-windows.jar.


Cheers,
Mark


Doug Cutting wrote:

Mark Miller wrote:
My only concern is with the size increase this will give to the 
Lucene jar. Another 17 meg - yikes!


You mean the release tar file, not the jar, right?  Is the size of the 
release really an issue for folks?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Potential bug in StandardTokenizerImpl

2007-11-28 Thread Shai Erera

I agree that being backward compatible is important. But ... I also work at
a company that delivers search solutions to many customers. Sometimes,
customers are being told that a specific fix will require them to rebuild
their indexes. Customers can then choose whether to install the fix or not.
However, from your statement below I gather that once Lucene 3.0 will be
out, we won't have to be backward compatible, and that fix can go into that
release ... if I'm right, then someone can mark that issue for 3.0 and not
2.3 (I'm not sure I have the permissions to do so).

Isn't there a way to include a fix that you can choose whether to install or
not? For example, I may want to download 2.3 (when it's out) and apply this
patch only. I'm sure there's a way to do it. If there is, we could publish
this as official in 3.0 and patch available for 2.3 (I fixed it only in
jflex, but can easily produce a patch for .jj file, so if will fix
2.2version as well).

My only concern is that this patch will get lost if we don't mark it for any
release ...

Shai

On Nov 28, 2007 9:18 PM, Chris Hostetter <[EMAIL PROTECTED]> wrote:

>
> : Thanks to Shai Erera for traslating the discussion into the developers'
> : list. I am surprised about Chris Hostetter's response, as this issue was
>
> to clarify: i'm not saying that the current behavior is ideal, or even
> correct -- i'm saying the current behavior is the current behavior, and
> changing it could easily break existing indexes -- something that the
> Lucene upgrade contract does not allow...
>
> http://wiki.apache.org/lucene-java/BackwardsCompatibility
>
> specificly: if someone built an index with 2.2, that index needs to work
> when queried by an app running 2.3 .. if we change the StandardTokenizer
> to treat this differnetly, that won't work.
>
> In some cases, being backwards compatible is more important then being
> "correct" ... i'm not 100% certain that this is one of those cases, i'm
> just pointing out that there is more to this issue then just a one line
> patch to some code.
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
Regards,

Shai Erera

[jira] Resolved: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code

2007-11-28 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch resolved LUCENE-1071.
---

Resolution: Fixed

Committed.

> SegmentMerger doesn't set payload bit in new optimized code
> ---
>
> Key: LUCENE-1071
> URL: https://issues.apache.org/jira/browse/LUCENE-1071
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Michael Busch
>Assignee: Michael Busch
> Fix For: 2.3
>
> Attachments: lucene-1071.patch
>
>
> In the new optimized code in SegmentMerger the payload bit is not set 
> correctly
> in the merged segment. This means that we loose all payloads during a merge!
> The Payloads unit test doesn't catch this. Now that we have the new
> DocumentsWriter we buffer much more docs by default then before. This means
> that the test cases can't assume anymore that the DocsWriter flushes after 10
> docs by default. TestPayloads however falsely assumed this, which means that 
> no
> merges happen anymore in TestPayloads. We should check whether there are
> other testcases that rely on this.
> The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a 
> patch
> soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code

2007-11-28 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-1071:
--

Attachment: lucene-1071.patch

I'm going to commit this very soon.

> SegmentMerger doesn't set payload bit in new optimized code
> ---
>
> Key: LUCENE-1071
> URL: https://issues.apache.org/jira/browse/LUCENE-1071
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.3
>Reporter: Michael Busch
>Assignee: Michael Busch
> Fix For: 2.3
>
> Attachments: lucene-1071.patch
>
>
> In the new optimized code in SegmentMerger the payload bit is not set 
> correctly
> in the merged segment. This means that we loose all payloads during a merge!
> The Payloads unit test doesn't catch this. Now that we have the new
> DocumentsWriter we buffer much more docs by default then before. This means
> that the test cases can't assume anymore that the DocsWriter flushes after 10
> docs by default. TestPayloads however falsely assumed this, which means that 
> no
> merges happen anymore in TestPayloads. We should check whether there are
> other testcases that rely on this.
> The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a 
> patch
> soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-1071) SegmentMerger doesn't set payload bit in new optimized code

2007-11-28 Thread Michael Busch (JIRA)

SegmentMerger doesn't set payload bit in new optimized code
---

 Key: LUCENE-1071
 URL: https://issues.apache.org/jira/browse/LUCENE-1071
 Project: Lucene - Java
  Issue Type: Bug
Affects Versions: 2.3
Reporter: Michael Busch
Assignee: Michael Busch
 Fix For: 2.3


In the new optimized code in SegmentMerger the payload bit is not set correctly
in the merged segment. This means that we loose all payloads during a merge!

The Payloads unit test doesn't catch this. Now that we have the new
DocumentsWriter we buffer much more docs by default then before. This means
that the test cases can't assume anymore that the DocsWriter flushes after 10
docs by default. TestPayloads however falsely assumed this, which means that no
merges happen anymore in TestPayloads. We should check whether there are
other testcases that rely on this.

The fixes for TestPayloads and SegmentMerger are very simple, I'll attach a 
patch
soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Payload Loading and Reloading

2007-11-28 Thread Grant Ingersoll

In working on LUCENE-1001, things are getting a bit complicated with  
loading payloads in overlapping spans (which causes the dreaded Can't  
load payload more than once error).


This got me thinking about why we need the rule that payloads can only  
be loaded once.  I forget the reasoning behind this.  Can we just  
store where the current position before we load the payload and then  
seek back to that point if we need to load the payload again?  I  
suppose in the case of really large payloads the seek on the  
IndexInput could be expensive, but in reality, most payloads aren't  
likely to be more than a few bytes, right?  There also seems to be  
some interactions with the lazy skipping that I haven't quite pinned  
down yet.  What else am I forgetting?


The other alternative I can think of is I could cache the payloads,  
but that seems unwieldy too.


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: First cut at web-based Luke for contrib

2007-11-28 Thread Mark Miller


Yes, didn't mean jar, meant zip...or tar.

I guess this may not be a sticky point for most people.

For me, I just get knee jerk seeing the dist size quadruple for a 
feature, that is in reality a very small part of the dist.


I am not arguing against adding it, just noting my stomach drop. Take it 
for what you will. It wouldn't stop me from downloading Lucene.


- Mark

Doug Cutting wrote:

Mark Miller wrote:
My only concern is with the size increase this will give to the 
Lucene jar. Another 17 meg - yikes!


You mean the release tar file, not the jar, right?  Is the size of the 
release really an issue for folks?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: First cut at web-based Luke for contrib

2007-11-28 Thread Doug Cutting


Mark Miller wrote:
My only concern is with the size increase this will give to the Lucene 
jar. Another 17 meg - yikes!


You mean the release tar file, not the jar, right?  Is the size of the 
release really an issue for folks?


Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: First cut at web-based Luke for contrib

2007-11-28 Thread Mark Miller


Compiled and ran it on Vista. Very cool. I am also a huge GWT fan.

This is a great start. Only issue I ran into was also the scrolling 
issue when selecting the drive...but based on the TODO: comments, it 
appears you have seen that.


Grant:
It doesn't ask permission to read your hard drive because a webapp 
running in Jetty is reading the server hard drive and sending the info 
to GWT with ajax rpc calls.


My only concern is with the size increase this will give to the Lucene 
jar. Another 17 meg - yikes!


- Mark

Grant Ingersoll wrote:

Seriously cool!


On Nov 28, 2007, at 6:13 PM, markharw00d wrote:


Any takers to test this contrib layout before I commit it?
  http://www.inperspective.com/lucene/webluke.zip
This is a (17MB) zip file which you can unzip to a new "webluke" 
directory under your copy of lucene/contrib and then run the usual 
Lucene Ant build ( or at least "ant build-contrib").
You should then find under build/contrib/webluke/WebLuke.war plus a 
Jetty-based server in build/contrib/webluke/dist which can be started 
using java -jar start.jar. A Luke webapp should then be available on


The zip doesn't have a parent dir, so I would put it under 
contrib/webluke



http://localhost:8080/WebLuke

I've tested building here on XP but want to check that the GWT 
compile task works ok for others as the Google compiler is packaged 
in a platform-specific windows jar. I've tested this ant build with 
the windows jar on a Linux box and all was OK so I'm guessing the 
platform-specific bits of the google dev tools are related to the 
browser choices used in their "hosted" dev mode rather than the 
Java-to-Javascript compiler.




Works for me on OS X 10.5 using Java 1.5.  Only minor annoyance is the 
expansion of directories when choosing the index directory in that it 
always scrolls back to the top, but heh, that is just a nit.


Also, doesn't it have to ask me for permission to access my hard 
drive?  Don't know about GWT, so maybe just my ignorance.


At any rate, I had it up and running and browsing my index in 
minutes.  And the Visualization of Zipfs law is really cool too, as is 
the vocab growth graph!


Very nice work and I imagine it will only get better.  +1 for adding 
it to contrib!


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: First cut at web-based Luke for contrib

2007-11-28 Thread Grant Ingersoll


Seriously cool!


On Nov 28, 2007, at 6:13 PM, markharw00d wrote:


Any takers to test this contrib layout before I commit it?
  http://www.inperspective.com/lucene/webluke.zip
This is a (17MB) zip file which you can unzip to a new "webluke"  
directory under your copy of lucene/contrib and then run the usual  
Lucene Ant build ( or at least "ant build-contrib").
You should then find under build/contrib/webluke/WebLuke.war plus a  
Jetty-based server in build/contrib/webluke/dist which can be  
started using java -jar start.jar. A Luke webapp should then be  
available on


The zip doesn't have a parent dir, so I would put it under contrib/ 
webluke



http://localhost:8080/WebLuke

I've tested building here on XP but want to check that the GWT  
compile task works ok for others as the Google compiler is packaged  
in a platform-specific windows jar. I've tested this ant build with  
the windows jar on a Linux box and all was OK so I'm guessing the  
platform-specific bits of the google dev tools are related to the  
browser choices used in their "hosted" dev mode rather than the Java- 
to-Javascript compiler.




Works for me on OS X 10.5 using Java 1.5.  Only minor annoyance is the  
expansion of directories when choosing the index directory in that it  
always scrolls back to the top, but heh, that is just a nit.


Also, doesn't it have to ask me for permission to access my hard  
drive?  Don't know about GWT, so maybe just my ignorance.


At any rate, I had it up and running and browsing my index in  
minutes.  And the Visualization of Zipfs law is really cool too, as is  
the vocab growth graph!


Very nice work and I imagine it will only get better.  +1 for adding  
it to contrib!


-Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-11-28 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546485
 ] 

Michael Busch commented on LUCENE-588:
--

True... a solution might be to have the queryparser map escaped chars to some
unused unicode codepoints. Then the WildcardQuery could distinguish escaped
chars. I'd guess that other classes, like FuzzyQuery might have the same 
problem?

The advantage of such a char mapping is that we can keep the String API and
don't have to add special APIs to the Query objects for the queryparser.

> Escaped wildcard character in wildcard term not handled correctly
> -
>
> Key: LUCENE-588
> URL: https://issues.apache.org/jira/browse/LUCENE-588
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP SP2
>Reporter: Sunil Kamath
>
> If an escaped wildcard character is specified in a wildcard query, it is 
> treated as a wildcard instead of a literal.
> e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
> discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-11-28 Thread Daniel Naber (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478
 ] 

[EMAIL PROTECTED] edited comment on LUCENE-588 at 11/28/07 3:27 PM:
---

The problem is that the WildcardQuery itself doesn't have a concept of escaped 
characters. The escape characters are removed in QueryParser. This mean "t?\?t" 
will arrive as "t??t" in WildcardQuery and the second question mark is also 
interpreted as a wildcard.


  was (Author: [EMAIL PROTECTED]):
The problem is that the WildcardQuery itself doesn't have a concept of 
escaped characters. The escape characters are removed in QueryParser. This mean 
"t?\\?t" will arrive as "t??t" in WildcardQuery and the second question mark is 
also interpreted as a wildcard.

  
> Escaped wildcard character in wildcard term not handled correctly
> -
>
> Key: LUCENE-588
> URL: https://issues.apache.org/jira/browse/LUCENE-588
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP SP2
>Reporter: Sunil Kamath
>
> If an escaped wildcard character is specified in a wildcard query, it is 
> treated as a wildcard instead of a literal.
> e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
> discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-11-28 Thread Daniel Naber (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546479
 ] 

Daniel Naber commented on LUCENE-588:
-

Also, the original report and my comment look confusing because Jira removes 
the backslash. Imagine a backslash in front of *one* of the question marks.

> Escaped wildcard character in wildcard term not handled correctly
> -
>
> Key: LUCENE-588
> URL: https://issues.apache.org/jira/browse/LUCENE-588
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP SP2
>Reporter: Sunil Kamath
>
> If an escaped wildcard character is specified in a wildcard query, it is 
> treated as a wildcard instead of a literal.
> e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
> discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Issue Comment Edited: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-11-28 Thread Daniel Naber (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478
 ] 

[EMAIL PROTECTED] edited comment on LUCENE-588 at 11/28/07 3:27 PM:
---

The problem is that the WildcardQuery itself doesn't have a concept of escaped 
characters. The escape characters are removed in QueryParser. This mean 
"t?\\?t" will arrive as "t??t" in WildcardQuery and the second question mark is 
also interpreted as a wildcard.


  was (Author: [EMAIL PROTECTED]):
The problem is that the WildcardQuery itself doesn't have a concept of 
escaped characters. The escape characters are removed in QueryParser. This mean 
"t?\?t" will arrive as "t??t" in WildcardQuery and the second question mark is 
also interpreted as a wildcard.

  
> Escaped wildcard character in wildcard term not handled correctly
> -
>
> Key: LUCENE-588
> URL: https://issues.apache.org/jira/browse/LUCENE-588
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP SP2
>Reporter: Sunil Kamath
>
> If an escaped wildcard character is specified in a wildcard query, it is 
> treated as a wildcard instead of a literal.
> e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
> discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-588) Escaped wildcard character in wildcard term not handled correctly

2007-11-28 Thread Daniel Naber (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546478
 ] 

Daniel Naber commented on LUCENE-588:
-

The problem is that the WildcardQuery itself doesn't have a concept of escaped 
characters. The escape characters are removed in QueryParser. This mean "t?\?t" 
will arrive as "t??t" in WildcardQuery and the second question mark is also 
interpreted as a wildcard.


> Escaped wildcard character in wildcard term not handled correctly
> -
>
> Key: LUCENE-588
> URL: https://issues.apache.org/jira/browse/LUCENE-588
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 2.0.0
> Environment: Windows XP SP2
>Reporter: Sunil Kamath
>
> If an escaped wildcard character is specified in a wildcard query, it is 
> treated as a wildcard instead of a literal.
> e.g., t\??t is converted by the QueryParser to t??t - the escape character is 
> discarded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

First cut at web-based Luke for contrib

2007-11-28 Thread markharw00d


Any takers to test this contrib layout before I commit it?
   http://www.inperspective.com/lucene/webluke.zip
This is a (17MB) zip file which you can unzip to a new "webluke" 
directory under your copy of lucene/contrib and then run the usual 
Lucene Ant build ( or at least "ant build-contrib").
You should then find under build/contrib/webluke/WebLuke.war plus a 
Jetty-based server in build/contrib/webluke/dist which can be started 
using java -jar start.jar. A Luke webapp should then be available on 
http://localhost:8080/WebLuke


I've tested building here on XP but want to check that the GWT compile 
task works ok for others as the Google compiler is packaged in a 
platform-specific windows jar. I've tested this ant build with the 
windows jar on a Linux box and all was OK so I'm guessing the 
platform-specific bits of the google dev tools are related to the 
browser choices used in their "hosted" dev mode rather than the 
Java-to-Javascript compiler.


I need to add Apache licenses to all the source yet and tidy some 
superfluous files but otherwise it feels just about ready to contribute.


Cheers
Mark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-11-28 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546420
 ] 

Michael Busch commented on LUCENE-584:
--

Yes you're right, I ran the tests w/ code coverage analysis enabled, and the
BitSetMatcher is fully covered. Good!

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Assignee: Michael Busch
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-11-28 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546393
 ] 

Paul Elschot commented on LUCENE-584:
-

With the full patch applied, the following test cases use a BitSetMatcher:

TestQueryParser
TestComplexExplanations
TestComplexExplanationsOfNonMatches
TestConstantScoreRangeQuery
TestDateFilter
TestFilteredQuery
TestMultiSearcherRanking
TestPrefixFilter
TestRangeFilter
TestRemoteCachingWrapperFilter
TestRemoteSearchable
TestScorerPerf
TestSimpleExplanations
TestSimpleExplanationsOfNonMatches
TestSort

so I don't think it is necessary to provide seperate test cases.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Assignee: Michael Busch
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Potential bug in StandardTokenizerImpl

2007-11-28 Thread Chris Hostetter


: Thanks to Shai Erera for traslating the discussion into the developers' 
: list. I am surprised about Chris Hostetter's response, as this issue was 

to clarify: i'm not saying that the current behavior is ideal, or even 
correct -- i'm saying the current behavior is the current behavior, and 
changing it could easily break existing indexes -- something that the 
Lucene upgrade contract does not allow...

http://wiki.apache.org/lucene-java/BackwardsCompatibility

specificly: if someone built an index with 2.2, that index needs to work 
when queried by an app running 2.3 .. if we change the StandardTokenizer 
to treat this differnetly, that won't work.

In some cases, being backwards compatible is more important then being 
"correct" ... i'm not 100% certain that this is one of those cases, i'm 
just pointing out that there is more to this issue then just a one line 
patch to some code.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1061) Adding a factory to QueryParser to instantiate query instances

2007-11-28 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546353
 ] 

Michael Busch commented on LUCENE-1061:
---

Yonik,

I remember that we talked briefly about a QueryFactory in Atlanta and 
you had some cool ideas. Maybe you could mention them here?

> Adding a factory to QueryParser to instantiate query instances
> --
>
> Key: LUCENE-1061
> URL: https://issues.apache.org/jira/browse/LUCENE-1061
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: QueryParser
>Affects Versions: 2.3
>Reporter: John Wang
> Fix For: 2.3
>
> Attachments: lucene_patch.txt
>
>
> With the new efforts with Payload and scoring functions, it would be nice to 
> plugin custom query implementations while using the same QueryParser.
> Included is a patch with some refactoring the QueryParser to take a factory 
> that produces query instances.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-28 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546352
 ] 

Doug Cutting commented on LUCENE-1044:
--

> I think deprecating flush(), renaming it to commit()

+1  That's clearer, since flushes are internal optimizations, while commits are 
important events to clients.

> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546312
 ] 

Michael McCandless commented on LUCENE-1044:


{quote}
When autoCommit is true, then we should periodically commit automatically. When 
autoCommit is false, then nothing should be committed until the IndexWriter is 
closed. The ambiguous case is flush(). I think the reason for exposing flush() 
was to permit folks to commit without closing, so I think flush() should commit 
too, but we could add a separate commit() method that flushes and commits.
{quote}

I think deprecating flush(), renaming it to commit(), and clarifying
the semantics to mean that commit() flushes pending docs/deletes,
commits a new segments_N, syncs all files referenced by this commit,
and blocks until the sync is complete, would make sense?  And,
commit() would in fact commit even when autoCommit is false (flush()
doesn't commit now when autoCommit=false, which is indeed confusing).

{quote}
Perhaps the semantics of autoCommit=true should be altered so that it commits 
less than every flush. Is that what you were proposing? If so, then I think 
it's a good solution. Prior to 2.2 the commit semantics were poorly defined. 
Folks were encouraged to close() their IndexWriter to persist changes, and 
that's about all we said. 2.2's docs say that things are committed at every 
flush, but there was no sync, so I don't think changing this could break any 
applications.

So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., 
only after merges. I'd also argue that we should be vague in the documentation 
about precisely when autoCommit=true commits. If someone needs to know exactly 
when things are committed then they should be encouraged to explicitly flush(), 
not to rely on autoCommit.
{quote}

OK, I will test the "sync only when committing a merge" approach for
performance.  Hopefully a foreground sync() is fine given that with
ConcurrentMergePolicy that's already in a background thread.  This
would be a nice simplification.

And I agree we should be vague about, and users should never rely on,
precisely when Lucene has really committed (sync'd) the changes to
disk.  I'll fix the javadocs.


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546309
 ] 

Michael McCandless commented on LUCENE-1044:



I modified the CFS sync case to NOT bother syncing the files that go
into the CFS.  I also turned off syncing of segments.gen.  I also
tested on a Windows Server 2003 box.

New patched attached (still a hack just to test performance!) and new
results.  All tests are with the "sync every commit" policy:

||IO System||CFS sync||CFS nosync||CFS % slower||non-CFS sync||non-CFS 
nosync||non-CFS % slower||
|2 drive RAID0 Windows 2003 Server R2 Enterprise x64|250|244|2.6%|241|241|0.1%|
|ReiserFS 6-drive RAID5 array Linux (2.6.22.1)|186|166|11.9%|145|142|2.0%|
|EXT3 single internal drive Linux (2.6.22.1)|160|158|0.9%|142|135|4.8%|
|4 drive RAID0 array Mac Pro (10.4 Tiger)|152|155|-2.4%|149|147|1.3%|
|Win XP Pro laptop, single drive|408|398|2.6%|343|346|-1.1%|
|Mac Pro single external drive|211|209|1.0%|167|149|12.4%|


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-28 Thread Doug Cutting (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546306
 ] 

Doug Cutting commented on LUCENE-1044:
--

> But must every "automatic buffer flush" by IndexWriter really be a
"permanent commit"?

When autoCommit is true, then we should periodically commit automatically.  
When autoCommit is false, then nothing should be committed until the 
IndexWriter is closed.  The ambiguous case is flush().  I think the reason for 
exposing flush() was to permit folks to commit without closing, so I think 
flush() should commit too, but we could add a separate commit() method that 
flushes and commits.

> People who upgrade will suddenly get much worse performance.

Yes, that would be bad.  Perhaps the semantics of autoCommit=true should be 
altered so that it commits less than every flush.  Is that what you were 
proposing?  If so, then I think it's a good solution.  Prior to 2.2 the commit 
semantics were poorly defined.  Folks were encouraged to close() their 
IndexWriter to persist changes, and that's about all we said.  2.2's docs say 
that things are committed at every flush, but there was no sync, so I don't 
think changing this could break any applications.

So I'm +1 for changing autoCommit=true to sync less than every flush, e.g., 
only after merges.  I'd also argue that we should be vague in the documentation 
about precisely when autoCommit=true commits.  If someone needs to know exactly 
when things are committed then they should be encouraged to explicitly flush(), 
not to rely on autoCommit.

> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2007-11-28 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546304
 ] 

Mark Miller commented on LUCENE-1026:
-

Hey Shai,

These fixes are great and I will incorporate them all.

I worked this up very quickly based on other less general code I am using. 
While I have not yet used this code for a project, it will be the framework 
that I migrate to for future projects. It should see much more development 
then. I am very eager to add some Searcher warming for one. Also, the tests 
where whipped together quite quickly. I appreciate your efforts at cleaning 
them up.

Buffing up the SearchServer code to production level will also be on my list.

Thanks for your improvements -- if you do any more work, keep me posted.

- Mark

> Provide a simple way to concurrently access a Lucene index from multiple 
> threads
> 
>
> Key: LUCENE-1026
> URL: https://issues.apache.org/jira/browse/LUCENE-1026
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search
>Reporter: Mark Miller
>Priority: Minor
> Attachments: DefaultIndexAccessor.java, 
> DefaultMultiIndexAccessor.java, IndexAccessor.java, 
> IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, 
> SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java
>
>
> For building interactive indexes accessed through a network/internet 
> (multiple threads).
> This builds upon the LuceneIndexAccessor patch. That patch was not very 
> newbie friendly and did not properly handle MultiSearchers (or at the least 
> made it easy to get into trouble).
> This patch simplifies things and provides out of the box support for sharing 
> the IndexAccessors across threads. There is also a simple test class and 
> example SearchServer to get you started.
> Future revisions will be zipped.
> Works pretty solid as is, but could use the ability to warm new Searchers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone

2007-11-28 Thread Mike Baroukh (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546301
 ] 

Mike Baroukh commented on LUCENE-1070:
--

I agree that nobody is forced to use DateTools. 
I used my own version, of course. 
But the report is not for *me*. It's just because I thought It was a bug.


I also know that for 2 same date, DateTools will return the same string.
My case is this :

I have Dates to index. 
When Indexing, my Date objects contains hour and minutes.
when searching, date are typed by users without time. They are parsed  with 
dd/MM/ pattern.

Because of the round() documentation, I thought there would be no problem 
because I use "DAY" Resolution.
RTFM is not always a good option.

Finally, maybe it's not a bug, it's an architectural issue : when using long 
for date, timezone is lost. 
I continue to think that dateToString must take a Date for parameter. This way, 
there would be no more ambiguity.





> DateTools with DAY resoltion dosn't work depending on your timezone
> ---
>
> Key: LUCENE-1070
> URL: https://issues.apache.org/jira/browse/LUCENE-1070
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.2
>Reporter: Mike Baroukh
>
> Hi.
> There is another issue, closed, that introduced a bug : 
> https://issues.apache.org/jira/browse/LUCENE-491
> Here is a simple TestCase :
> DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm");
> Date d1 = df.parse("10/10/2008 10:00");
> System.err.println(DateTools.dateToString(d1, Resolution.DAY));
> Date d2 = df.parse("10/10/2008 00:00");
> System.err.println(DateTools.dateToString(d2, Resolution.DAY));
> this output :
> 20081010
> 20081009
> So, days are the same, but with DAY resolution, the value indexed doesn't 
> refer to the same day.
> This is because of DateTools.round() : using a Calendar initialised to GMT 
> can make that the Date given is on yesterday depending on my timezone .
> The part I don't  understand is why take a date for inputfield then convert 
> it to calendar then convert it again before printing ?
> This operation is supposed to "round" the date but using simply DateFormat to 
> format the date and print only wanted fields do the same work, isn't it ?
> The problem is : I see absolutly no solution actually. We could have a 
> WorkAround if datetoString() took a Date as inputField but with a long, the 
> timezone is lost.
> I also suppose that the correction made on the other issue 
> (https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug 
> because it correct only for those who use date with a different timezone than 
> the local timezone of the JVM.
> So, my solution : add a DateTools.dateToString() that take a Date in 
> parameter and deprecate the version that use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2007-11-28 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1026:
---

Attachment: IndexAccessorFactory.java

> Provide a simple way to concurrently access a Lucene index from multiple 
> threads
> 
>
> Key: LUCENE-1026
> URL: https://issues.apache.org/jira/browse/LUCENE-1026
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search
>Reporter: Mark Miller
>Priority: Minor
> Attachments: DefaultIndexAccessor.java, 
> DefaultMultiIndexAccessor.java, IndexAccessor.java, 
> IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, 
> SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java
>
>
> For building interactive indexes accessed through a network/internet 
> (multiple threads).
> This builds upon the LuceneIndexAccessor patch. That patch was not very 
> newbie friendly and did not properly handle MultiSearchers (or at the least 
> made it easy to get into trouble).
> This patch simplifies things and provides out of the box support for sharing 
> the IndexAccessors across threads. There is also a simple test class and 
> example SearchServer to get you started.
> Future revisions will be zipped.
> Works pretty solid as is, but could use the ability to warm new Searchers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2007-11-28 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1026:
---

Attachment: shai-IndexAccessor.zip

> Provide a simple way to concurrently access a Lucene index from multiple 
> threads
> 
>
> Key: LUCENE-1026
> URL: https://issues.apache.org/jira/browse/LUCENE-1026
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search
>Reporter: Mark Miller
>Priority: Minor
> Attachments: DefaultIndexAccessor.java, 
> DefaultMultiIndexAccessor.java, IndexAccessor.java, 
> IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, 
> SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java
>
>
> For building interactive indexes accessed through a network/internet 
> (multiple threads).
> This builds upon the LuceneIndexAccessor patch. That patch was not very 
> newbie friendly and did not properly handle MultiSearchers (or at the least 
> made it easy to get into trouble).
> This patch simplifies things and provides out of the box support for sharing 
> the IndexAccessors across threads. There is also a simple test class and 
> example SearchServer to get you started.
> Future revisions will be zipped.
> Works pretty solid as is, but could use the ability to warm new Searchers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2007-11-28 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-1026:
---

Attachment: (was: IndexAccessorFactory.java)

> Provide a simple way to concurrently access a Lucene index from multiple 
> threads
> 
>
> Key: LUCENE-1026
> URL: https://issues.apache.org/jira/browse/LUCENE-1026
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search
>Reporter: Mark Miller
>Priority: Minor
> Attachments: DefaultIndexAccessor.java, 
> DefaultMultiIndexAccessor.java, IndexAccessor.java, 
> IndexAccessorFactory.java, MultiIndexAccessor.java, shai-IndexAccessor.zip, 
> SimpleSearchServer.java, StopWatch.java, TestIndexAccessor.java
>
>
> For building interactive indexes accessed through a network/internet 
> (multiple threads).
> This builds upon the LuceneIndexAccessor patch. That patch was not very 
> newbie friendly and did not properly handle MultiSearchers (or at the least 
> made it easy to get into trouble).
> This patch simplifies things and provides out of the box support for sharing 
> the IndexAccessors across threads. There is also a simple test class and 
> example SearchServer to get you started.
> Future revisions will be zipped.
> Works pretty solid as is, but could use the ability to warm new Searchers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1026) Provide a simple way to concurrently access a Lucene index from multiple threads

2007-11-28 Thread Shai Erera (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546288
 ] 

Shai Erera commented on LUCENE-1026:


Hi
I've downloaded the code and tried to run the tests, but I think there are some 
problems:
1. The delete() method in the test attempts to delete the directory, and not 
the underlying files. So in effect it does not do anything.
2. Some of the tests that start new threads don't wait for them (by calling 
join()). That of course causes some Accessors to be removed (after you call 
closeAllAccessors()), while those threads are sill running.
I've fixed those issues in the test. I'd appreciate if you can take a look.

Also, in IndexAccessorFactory I've found some issues:
1. I guess you wanted to have it as a Singleton - so I defined a private 
default constructor to prevent applications from instantiating it.
2. I modified the code of createAccessor to first lookup if an accessor for 
that directory already exists. It should save the allocation of 
DefaultIndexAccessor.
3. I modified the implementation of other methods to access the HashMap of 
accessors more efficiently.

I'd appreciate if you can review my fixes. I'll attach them separately.

> Provide a simple way to concurrently access a Lucene index from multiple 
> threads
> 
>
> Key: LUCENE-1026
> URL: https://issues.apache.org/jira/browse/LUCENE-1026
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index, Search
>Reporter: Mark Miller
>Priority: Minor
> Attachments: DefaultIndexAccessor.java, 
> DefaultMultiIndexAccessor.java, IndexAccessor.java, 
> IndexAccessorFactory.java, MultiIndexAccessor.java, SimpleSearchServer.java, 
> StopWatch.java, TestIndexAccessor.java
>
>
> For building interactive indexes accessed through a network/internet 
> (multiple threads).
> This builds upon the LuceneIndexAccessor patch. That patch was not very 
> newbie friendly and did not properly handle MultiSearchers (or at the least 
> made it easy to get into trouble).
> This patch simplifies things and provides out of the box support for sharing 
> the IndexAccessors across threads. There is also a simple test class and 
> example SearchServer to get you started.
> Future revisions will be zipped.
> Works pretty solid as is, but could use the ability to warm new Searchers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1070) DateTools with DAY resoltion dosn't work depending on your timezone

2007-11-28 Thread Alexei Dets (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546284
 ] 

Alexei Dets commented on LUCENE-1070:
-

I'm not a Lucene developer but just wanted to comment from a user perspective - 
I found the current Lucene behavior 100% correct and this bug report wrong. 

First of all, AFAIK this doesn't have anything to do with the DAY precision - 
with any higher precision one can also get other day (and other hour), this is 
just how timezones work: conversion from one timezone to another changes time. 
But then during the search one should also use DateTools.dateToString and the 
search will work correctly. And after applying DateTools.stringToDate on search 
results you'll get the correct dates.

Search with a DAY precision will search for the given day in UTC timezone not 
in a local one, if it is not sufficient for your purposes then you should use 
an HOUR precision during indexing and search - DAY is simply not precise enough 
for your purposes. Another alternative that should probably work (I never 
tried) is to create your Date (that you are passing to DateTools.dateToString) 
in UTC timezone, in this case no timezone conversion should be applied unless 
there is some bug in DateTools and you'll get exactly the same day indexed (but 
then on retrieving results DateTools.stringToDate will change you day because 
it'll apply the local timezone). And after all nobody is forced to use 
DateTools and can implement their own way to store dates if no timezone 
conversions are required - it is probably the best way for this _specific_ case.

> DateTools with DAY resoltion dosn't work depending on your timezone
> ---
>
> Key: LUCENE-1070
> URL: https://issues.apache.org/jira/browse/LUCENE-1070
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 2.2
>Reporter: Mike Baroukh
>
> Hi.
> There is another issue, closed, that introduced a bug : 
> https://issues.apache.org/jira/browse/LUCENE-491
> Here is a simple TestCase :
> DateFormat df = new SimpleDateFormat("dd/MM/ HH:mm");
> Date d1 = df.parse("10/10/2008 10:00");
> System.err.println(DateTools.dateToString(d1, Resolution.DAY));
> Date d2 = df.parse("10/10/2008 00:00");
> System.err.println(DateTools.dateToString(d2, Resolution.DAY));
> this output :
> 20081010
> 20081009
> So, days are the same, but with DAY resolution, the value indexed doesn't 
> refer to the same day.
> This is because of DateTools.round() : using a Calendar initialised to GMT 
> can make that the Date given is on yesterday depending on my timezone .
> The part I don't  understand is why take a date for inputfield then convert 
> it to calendar then convert it again before printing ?
> This operation is supposed to "round" the date but using simply DateFormat to 
> format the date and print only wanted fields do the same work, isn't it ?
> The problem is : I see absolutly no solution actually. We could have a 
> WorkAround if datetoString() took a Date as inputField but with a long, the 
> timezone is lost.
> I also suppose that the correction made on the other issue 
> (https://issues.apache.org/jira/browse/LUCENE-491) is worse than the bug 
> because it correct only for those who use date with a different timezone than 
> the local timezone of the JVM.
> So, my solution : add a DateTools.dateToString() that take a Date in 
> parameter and deprecate the version that use a long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-28 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1058:


Attachment: LUCENE-1058.patch

Tee it is.  And here I just thought you liked golf!  I guess I have never used 
the tee command in UNIX.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1044) Behavior on hard power shutdown

2007-11-28 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546277
 ] 

Michael McCandless commented on LUCENE-1044:


{quote}
I'm confused. The semantics of commit should be that all changes prior are made 
permanent, and no subsequent changes are permanent until the next commit. So 
syncs, if any, should map 1:1 to commits, no? Folks can make indexing faster by 
committing/syncing less often.
{quote}

But must every "automatic buffer flush" by IndexWriter really be a
"permanent commit"?  I do agree that when you close an IndexWriter, we
should should do a "permanent commit" (and block until it's done).

Even if we use that policy, the BG sync thread can still fall behind
such that the last few/many flushes are still in-process of being made
permanent (eg I see this happening while a merge is running).  In fact
I'll have to block further flushes if syncing falls "too far" behind,
by some metric.  So, we already won't have any "guarantee" on when a
given flush actually becomes permanent even if we adopt this policy.

I think "merge finished" should be made a "permanent commit" because
otherwise we are tying up potentially alot of disk space,
temporarily.  But for a flush there's only a tiny amount of space (the
old segments_N files) being tied up.

Maybe we could make some flushes permanent but not all, depending on
how far behind the sync thread is.  EG if you do a flush, but, the
sync thread is still trying to make the last flush permanent, don't
force the new flush to be permanent?

In general, I think the longer we can wait after flushing before
forcing the OS to make those writes "permanent", the better the
chances that the OS has in fact already sync'd those files anyway, and
so the sync cost should be lower.  So maybe we could make every flush
permanent, but wait a little while before doing so?

Regardless of what policy we choose here (which commits must be made
"permanent", and, when) I think the approach requires that
IndexFileDeleter query the Directory so that it's only allowed to
delete older commit points once a newer commit point has successfully
become permanent.

I also worry about those applications that are accidentally flushing
too often now.  Say your app now sets maxBufferedDocs=100.  Right now,
that gives you poor performance but not disastrous, but I fear if we
do the "every commit is permanent" policy then performance could
easily become disastrous.  People who upgrade will suddenly get much
worse performance.


> Behavior on hard power shutdown
> ---
>
> Key: LUCENE-1044
> URL: https://issues.apache.org/jira/browse/LUCENE-1044
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
> Environment: Windows Server 2003, Standard Edition, Sun Hotspot Java 
> 1.5
>Reporter: venkat rangan
>Assignee: Michael McCandless
> Fix For: 2.3
>
> Attachments: FSyncPerfTest.java, LUCENE-1044.patch, 
> LUCENE-1044.take2.patch, LUCENE-1044.take3.patch, LUCENE-1044.take4.patch
>
>
> When indexing a large number of documents, upon a hard power failure  (e.g. 
> pull the power cord), the index seems to get corrupted. We start a Java 
> application as an Windows Service, and feed it documents. In some cases 
> (after an index size of 1.7GB, with 30-40 index segment .cfs files) , the 
> following is observed.
> The 'segments' file contains only zeros. Its size is 265 bytes - all bytes 
> are zeros.
> The 'deleted' file also contains only zeros. Its size is 85 bytes - all bytes 
> are zeros.
> Before corruption, the segments file and deleted file appear to be correct. 
> After this corruption, the index is corrupted and lost.
> This is a problem observed in Lucene 1.4.3. We are not able to upgrade our 
> customer deployments to 1.9 or later version, but would be happy to back-port 
> a patch, if the patch is small enough and if this problem is already solved.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-28 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546274
 ] 

Yonik Seeley commented on LUCENE-1058:
--

The SinkTokenizer name could make sense, but I think TeeTokenFilter makes more 
sense than SourceTokenFilter (it is a tee, it splits a single token stream into 
two, just like the UNIX tee command).


> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-28 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1058:


Attachment: LUCENE-1058.patch

Whew.  I think we are there and I like it!  

I renamed Yonik's suggestions to be SinkTokenizer and SourceTokenFilter to 
model the whole source/sink notion.  Hopefully people won't think the 
SourceTokenFilter is for processing code.  :-)

I will commit tomorrow if there are no objections.

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1058) New Analyzer for buffering tokens

2007-11-28 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546255
 ] 

Grant Ingersoll commented on LUCENE-1058:
-

Will do.  Patch to follow shortly

> New Analyzer for buffering tokens
> -
>
> Key: LUCENE-1058
> URL: https://issues.apache.org/jira/browse/LUCENE-1058
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1058.patch, LUCENE-1058.patch, LUCENE-1058.patch, 
> LUCENE-1058.patch, LUCENE-1058.patch
>
>
> In some cases, it would be handy to have Analyzer/Tokenizer/TokenFilters that 
> could siphon off certain tokens and store them in a buffer to be used later 
> in the processing pipeline.
> For example, if you want to have two fields, one lowercased and one not, but 
> all the other analysis is the same, then you could save off the tokens to be 
> output for a different field.
> Patch to follow, but I am still not sure about a couple of things, mostly how 
> it plays with the new reuse API.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-dev/54397?search_string=BufferingAnalyzer;#54397

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2007-11-28 Thread Mark Miller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546236
 ] 

Mark Miller commented on LUCENE-794:


Michael: I would love to take a look. I've got the code you sent me and I will 
go through it soon.

Mark: That is an issue that should probably be cleaned up. A lot of tests are 
shared, the new SpanScorer just requires some different, odd,  setup that made 
it easier to copy and change the test file.  I will spend some time trying to 
combine them into one test file to avoid the overlap.

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
> spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
> spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
> spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
> spanhighlighter_patch_4.zip
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2007-11-28 Thread Mark Harwood (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546181
 ] 

Mark Harwood commented on LUCENE-794:
-

Makes sense to commit it to me.
I want to spend some time reviewing this in more detail once I'm through with 
contributing the new web-based version of Luke.
At a quick glance, does the new Junit test in this patch encompass both old and 
new Highlighter tests? In which case should we remove the old Junit test if 
they overlap?


> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: spanhighlighter.patch, spanhighlighter10.patch, 
> spanhighlighter11.patch, spanhighlighter12.patch, spanhighlighter2.patch, 
> spanhighlighter3.patch, spanhighlighter5.patch, spanhighlighter6.patch, 
> spanhighlighter7.patch, spanhighlighter8.patch, spanhighlighter9.patch, 
> spanhighlighter_patch_4.zip
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-584) Decouple Filter from BitSet

2007-11-28 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-584:
-

Attachment: lucene-584.patch

OK, here's a patch that compiles cleanly on current trunk and all tests 
pass. It includes:
- all changes from Matcher-20071122-1ground.patch 
- util/BitSetMatcher.java from Matcher-20070905-2default.patch 
- Hits.java changes from Matcher-20070905-3core.patch
- Filter#getMatcher() returns the BitSetMatcher

Would you be up for providing testcases?

As I said I haven't fully reviewed the patch, but I'm planning to do 
that soon. I can vouch that all tests pass after applying the patch.

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Assignee: Michael Busch
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, lucene-584.patch, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Assigned: (LUCENE-584) Decouple Filter from BitSet

2007-11-28 Thread Michael Busch (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-584:


Assignee: Michael Busch

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Assignee: Michael Busch
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-11-28 Thread Paul Elschot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546166
 ] 

Paul Elschot commented on LUCENE-584:
-

The patch is backwards compatible, except for current subclasses of Filter 
already have a getMatcher method. The fact that no changes are needed to 
contrib confirms the compatibility.

I have made no performance tests on BitSetMatcher for two reasons.
The first reason is that OpenBitSet is actually faster than BitSet (have a look 
at the graph in the SomeMatchers.zip file attachment by Eks Dev), so it seems 
to be better to go in that direction.
The second is that it is easy to do the skipping in IndexSearcher on a BitSet 
directly by using nextSetBit on the BitSet instead of skipTo on the 
BitSetMatcher. For this it would only be necessary to check whether the given 
MatchFilter is a Filter.
Anyway, I prefer to see where the real performance bottlenecks are before 
optimizing for performance.

DefaultMatcher should be in the ...2default... patch.
The change in Hits to use MatchFilter should be in the ...3core.. patch.

So far, I never tried to use these patches on their own, I have only split them 
for a better overview. Splitting the combined patches to iterate would need a 
different split, as you found out. It might even be necessary to split within a 
single class, but I'll gladly do that.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, 
> Matcher-20071122-1ground.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

41 matches

Mail list logo