[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-08-09 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518642
 ] 

Mark Harwood commented on LUCENE-584:
-

Some further thought on the roles/responsibilities of the various components:

Given a blank sheet of paper (a luxury we may not have) the minimum 
requirements I would have could be met with the following:
(note that use of the words "Matcher" and "Filter" etc have been removed 
because sets of doc IDs have applications outside of filtering/querying e.g. 
category counts)

interface DocIdSetFactory
{
DocIdSet getDocIdSet(IndexReader reader)
}
This is more or less equivalent to the purpose of the existing "Filter" - 
different implementations define their own selection criteria and produce a set 
of matching doc Ids e.g. equivalent of RangeFilter. Each implementation must 
implement "hashcode" and "equals" methods based on it's criteria so the factory 
can be cached and reused (in the same way Query objects are expected to). The 
existing CachedFilterBuilder in the XMLQueryParser provides one example of a 
strategy for caching Filters using this facility. 


interface DocIdSet
{
DocIdSetIterator getIterator();
}
This interface defines an immutable, threadsafe (and therefore cachable) 
collection of doc IDs. Different implementations provide space-efficient 
alternatives for sparse or heavily populated sets e.g. BitSet, OpenBitSet, 
SortedVIntList. As an example caching strategy - the existing 
CachingWrapperFilter would cache these objects in a WeakHashMap keyed on 
IndexReader.

interface DocIdSetIterator
{
boolean next();
int getDoc();
   etc
}
A thread unsafe, single use object, (probably with only one implementation) 
that is used to iterate across any DocIdSet. Not cachable and used by Scorers.

In the existing proposal it feels like DocIdSet and DocIdSetIterator are rolled 
into one in the form of the Matcher which complicates/prevents caching 
strategies.

Cheers
Mark




> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-966) A faster JFlex-based replacement for StandardAnalyzer

2007-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518661
 ] 

Michael McCandless commented on LUCENE-966:
---

> > * I removed StandardAnalyzer.html "grammar doc" generation from
> > build.xml since it was using jjdoc. Stanislaw, is there something
> > > in jflex that can generated a BNF description of the grammar as
> > > HTML?

> I've had a look at the JFlex docs and there doesn't seem to be such
> a tool for JFlex, I'm afraid.

OK, I think we can just live without this.  Thanks!

> A faster JFlex-based replacement for StandardAnalyzer
> -
>
> Key: LUCENE-966
> URL: https://issues.apache.org/jira/browse/LUCENE-966
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Analysis
>Reporter: Stanislaw Osinski
> Fix For: 2.3
>
> Attachments: AnalyzerBenchmark.java, jflex-analyzer-patch.txt, 
> jflex-analyzer-r560135-patch.txt, jflex-analyzer-r561292-patch.txt, 
> jflex-analyzer-r561693-compatibility.txt, 
> jflex-analyzer-r562378-patch-nodup.txt, jflex-analyzer-r562378-patch.txt
>
>
> JFlex (http://www.jflex.de/) can be used to generate a faster (up to several 
> times) replacement for StandardAnalyzer. Will add a patch and a simple 
> benchmark code in a while.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518663
 ] 

Michael McCandless commented on LUCENE-971:
---

Super, new patch looks good.  I will commit!

> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt, 
> LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-971) Create enwiki indexable data as line-per-article rather than file-per-article

2007-08-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-971.
---

Resolution: Fixed


OK, committed with these small changes:

  * Replaced conf/wikipedia.alg -> conf/extractWikipedia.alg in the
comment in that file.

  * Moved doc.maker line up under the "# Where to get documents from:"
comment

  * In build.xml, removed the extract-enwiki target so that "ant
enwiki" does the right thing.

Thanks Steve!


> Create enwiki indexable data as line-per-article rather than file-per-article
> -
>
> Key: LUCENE-971
> URL: https://issues.apache.org/jira/browse/LUCENE-971
> Project: Lucene - Java
>  Issue Type: Improvement
>Reporter: Steven Parkes
>Assignee: Steven Parkes
> Attachments: LUCENE-971.patch.txt, LUCENE-971.patch.txt, 
> LUCENE-971.patch.txt
>
>
> Create a line per article rather than a file. Consume with indexLineFile task.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Lucene-java Wiki] Update of "SandraVerso" by SandraVerso

2007-08-09 Thread Grant Ingersoll
You know, I was wondering the same thing, but then decided to let it  
go.  This person did add to the Powered By page as well with what  
looks to be legitimate sites, and the Wiki does allow you to create a  
personal page, so I don't think they are doing anything out of line.   
I guess it is something to be monitored.


-Grant

On Aug 9, 2007, at 2:49 AM, Doron Cohen wrote:


Is this new page spam or is it just me?

Doron

Apache Wiki <[EMAIL PROTECTED]> wrote on 06/08/2007 00:25:25:


Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-
java Wiki" for change notification.

The following page has been changed by SandraVerso:
http://wiki.apache.org/lucene-java/SandraVerso


-- 




  I am just another web developer. These are some projects of mine

  [http://www.onlinespiele-1.de Spiele] - website  with free games
+ [[BR]]
  [http://www.testcity.de Tests] - website with entertainment tests
+ [[BR]]
  [http://www.verwandt.de Stammbaum] - online family tree software




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



why the mixed scopes?

2007-08-09 Thread karl wettin
Is there a reason for writers to be created via the constructor and  
the readers to be created via the static scope?


Would it not be much more beautiful and object oriented to handle  
this using factory methods in Directory? Or perhaps even one layer  
further down? The patch below will allow for alternative stores  
(read: LUCENE-550) without changeing more than a single line of code  
(given that code uses the factory methods).


Also, it would make it much simple to augment Lucene with decorative  
layers such as notification schemes, et c.


According to /me/ the static scope is a violation of pretty much  
everything.



Index: src/java/org/apache/lucene/store/Index.java
===
--- src/java/org/apache/lucene/store/Index.java (revision 0)
+++ src/java/org/apache/lucene/store/Index.java (revision 0)
@@ -0,0 +1,35 @@
+package org.apache.lucene.store;
+
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.analysis.Analyzer;
+
+import java.io.IOException;
+
+/**
+ * A top level store class with factory methods for reader and writer.
+ */
+public abstract class Index {
+
+  public abstract IndexReader indexReaderFactory() throws IOException;
+  public IndexWriter indexWriterFactory(Analyzer analyzer) throws  
IOException {

+return indexWriterFactory(analyzer, false);
+  }
+  public abstract IndexWriter indexWriterFactory(Analyzer analyzer,  
boolean create) throws IOException;

+
+}
Index: src/java/org/apache/lucene/store/Directory.java
===
--- src/java/org/apache/lucene/store/Directory.java (revision  
564022)

+++ src/java/org/apache/lucene/store/Directory.java (working copy)
@@ -17,6 +17,10 @@
  * limitations under the License.
  */
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.analysis.Analyzer;
+
import java.io.IOException;
/** A Directory is a flat list of files.  Files may be written once,  
when they

@@ -36,8 +40,16 @@
  *
  * @author Doug Cutting
  */
-public abstract class Directory {
+public abstract class Directory extends Index {
+
+  public IndexReader indexReaderFactory() throws IOException {
+return IndexReader.open(this);
+  }
+  public IndexWriter indexWriterFactory(Analyzer analyzer, boolean  
create) throws IOException {

+return new IndexWriter(this, analyzer, create);
+  }
+
   /** Holds the LockFactory instance (implements locking for
* this Directory instance). */
   protected LockFactory lockFactory;



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-974) Remove Author tags from code

2007-08-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518718
 ] 

Grant Ingersoll commented on LUCENE-974:


I'm going to commit this today.

> Remove Author tags from code
> 
>
> Key: LUCENE-974
> URL: https://issues.apache.org/jira/browse/LUCENE-974
> Project: Lucene - Java
>  Issue Type: Wish
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Trivial
> Attachments: LUCENE-974.patch
>
>
> Remove all author tags from the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-974) Remove Author tags from code

2007-08-09 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-974.


Resolution: Fixed

Committed

> Remove Author tags from code
> 
>
> Key: LUCENE-974
> URL: https://issues.apache.org/jira/browse/LUCENE-974
> Project: Lucene - Java
>  Issue Type: Wish
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Trivial
> Attachments: LUCENE-974.patch
>
>
> Remove all author tags from the code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-08-09 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518825
 ] 

Paul Elschot commented on LUCENE-584:
-

Mark,

I said: "there is never a threadsafety problem. (See BitSetMatcher.getMatcher() 
which uses a local class for the resulting Matcher.)"
That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and 
SortedVIntList has a getMatcher() method, and I confused the two.

A Matcher is intended to be used in a single thread, so I don't expect thread 
safety problems.

The problem for the XML parser is that with this patch, the implementing data 
structure of a Filter becomes
unaccessible from the Filter class, so it cannot be cached from there.
That means that some cached data structure will have to be chosen, and one way 
to do
that is by using class BitSetFilter from the patch. This has a bits() method 
just like the current Filter class.
CachingWrapperFilter could then become a cache for BitSetFilter.

There is indeed no caching of filters in this patch.
The reason for that is that some Filters do not need a cache. For example:
class TermFilter {
  TermFilter(Term t) {this.term = t;}
  Matcher getMatcher(reader) {return new TermMatcher( 
reader.termDocs(this.term);}
}
TermMatcher does not exist (yet), but it could be easily introduced by leaving 
all the
scoring out of the current TermScorer.

As for DocIdSet, as long as this provides a Matcher as an iterator, it can be 
used to
implement a (caching) filter.

I don't think this patch complicates the implementation of caching strategies.
For example one could define:
class CachableFilter extends Filter {
  ... some methods to access the underlying data structure to be cached. ...
}
or write a similar adapter for some subclass of Filter and then write a 
FilterCache that caches these.

I did consider defining Matcher as an interface, but I preferred not to do that 
because
of the default explain() method in the Matcher class of the patch.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best Practices for getting Strings from a position range

2007-08-09 Thread Peter Keegan
Hi Grant,

I'm hoping to check this out soon.

Thanks,
Peter

On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>
> Hi Peter,
>
> Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> provides a TermVectorMapper that loads by position.
>
> Still not what ideally what you want, but I haven't had time to scope
> that one out yet.,
>
> -Grant
>
> On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
>
> > Hi Grant,
> >
> > No problem - I know you are very busy.  I just wanted to get a
> > sense for the
> > timing because I'd like to use this for a release this Fall. If I
> > can get a
> > prototype working in the coming weeks AND the performance is
> > great :) , this
> > would be terrific. If not, I'll have to fall back on a more complex
> > design
> > that handles the query outside of Lucene :(
> >
> > In the meantime, I'll try playing with LUCENE-868.
> >
> > Thanks for the update.
> > Peter
> >
> > On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >>
> >> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
> >> happening this week, but maybe next.
> >>
> >> I do think the Mapper approach via TermVectors will work.  It will
> >> require implementing a new mapper that orders by position, but I
> >> don't think that is too hard.   I started on one on the LUCENE-868
> >> patch (version 4) but it is not complete.  Maybe you want to pick
> >> it up?
> >>
> >> With this approach, you would iterate your spans, when you come to a
> >> new doc, you would load the term vector using the PositionMapper, and
> >> then you could index into the positions for the matches in the
> >> document.
> >>
> >> I realize this does not cover the just wanting to get the Payload at
> >> the match issue.  Maybe next week...
> >>
> >> Cheers,
> >> Grant
> >>
> >> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> >>
> >> > Any idea on when this might be available (days, weeks...)?
> >> >
> >> > Peter
> >> >
> >> > On 7/16/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>
> >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> >> >>
> >> >> >
> >> >> > : Do we have a best practice for going from, say a SpanQuery
> >> doc/
> >> >> > : position information and retrieving the actual range of
> >> >> positions of
> >> >> > : content from the Document?  Is it just to reanalyze the
> >> Document
> >> >> > : using the appropriate Analyzer and start recording once you
> >> >> hit the
> >> >> > : positions you are interested in?Seems like Term Vectors
> >> >> _could_
> >> >> > : help, but even my new Mapper approach patch (LUCENE-868)
> >> doesn't
> >> >> > : really help, because they are stored in a term-centric
> >> manner.  I
> >> >> > : guess what I am after is a position centric approach.  That
> >> >> is, give
> >> >> >
> >> >> > this is kind of what i was suggesting in the last message i sent
> >> >> > to the java-user thread about paylods and SpanQueries (which i'm
> >> >> > guessing is what prompted this thread as well)...
> >> >> >
> >> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
> >> >> > tf3988826.html#a11551628
> >> >>
> >> >>
> >> >> This is one use case, the other is related to the new patch I
> >> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> >> >> that identifies a bunch of docs and positions ahead of time.  Then
> >> >> the user enters new Span Query and I want to relate the matches
> >> from
> >> >> the user query with the positions of matches in the filter and
> >> then
> >> >> show that window.
> >> >>
> >> >> >
> >> >> > my point was that currently, to retrieve a payload you need a
> >> >> > TermPositions instance, which is designed for iterating in the
> >> >> > order of...
> >> >> > seek(term)
> >> >> >   skipTo(doc)
> >> >> >  nextPosition()
> >> >> > getPayload()
> >> >> > ...which is great for getting the payload of every instance
> >> >> > (ie:position) of a specific term in a given document (or in
> >> every
> >> >> > document) but without serious changes to the Spans API, the
> >> ideal
> >> >> > payload
> >> >> > API would let you say...
> >> >> > skipTo(doc)
> >> >> >advance(startPosition)
> >> >> >  getPayload()
> >> >> >while (nextPosition() < endPosition)
> >> >> >  getPosition()
> >> >> >
> >> >> > but this seems like a nearly impossible API to implement
> >> given the
> >> >> > natore
> >> >> > of hte inverted index and the fact that terms aren't ever
> >> stored in
> >> >> > position order.
> >> >> >
> >> >> > there's a lot i really don't know/understand about the lucene
> >> term
> >> >> > position internals ... but as i recall, the datastructure
> >> written
> >> >> > to disk
> >> >> > isn't actually a tree structure inverted index, it's a long
> >> >> > sequence of
> >> >> > tuples correct?  so in theory you could scan along the tuples
> >> >> > untill you
> >> >> > find the doc you are interested in, ignoring all of the term
> >> info
> >> >> > along

[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-08-09 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518845
 ] 

Mark Harwood commented on LUCENE-584:
-

Hi Paul,

Not sure we've reached a common understanding here yet.

You said "That was a mistake. BitSetMatcher is a Matcher constructed from a 
BitSet, and SortedVIntList has a getMatcher() method, and I confused the two. "
Ok, thanks for the clarification. I still feel uncomfortable because the method 
getMatcher() is not abstracted to a common interface. This was the thinking 
behind my "getIterator" method on DocIdSet.

I too made a mistake in my earlier comments. DocIdSetIterator does NOT have 
"probably one implementation". There would be an implementation for each 
different type of DocIdSet (Bitset/OpenBitSet/VIntList).

You said "some Filters do not need a cache. For example: TermFilter".  I'm not 
sure why that has been singled out as not worthy of caching. I have certain 
terms (e.g. gender:male) where the TermDocs is very large (50% of all docs in 
the index!) so multiple calls to TermDocs for term "gender:male" (if that is 
what you are suggesting) is highly undesirable. These are typically handled in 
the XMLQueryParser using syntax like this:
  
male
  

You said: "CachingWrapperFilter could then become a cache for BitSetFilter. "
This means that the only caching strategy is one based on bitsets - does this 
not lose perhaps the main benefit of your whole proposal? - the ability to have 
alternative space efficient storage of sets of document ids e.g. SortedVIntList.

If this is undesirable (my guess is "yes") then the proposal in my previous 
comment is a solution which allows for caching of any/all types of the new sets 
(openBitSet,BitSet,SortedVIntList etc) Regardless of my choice of class names 
or decisions over interfaces vs abstract classes do you not at least agree the 
need for 3 types of functionality:

1) A factory for instantiating sets of document ids matching a particular set 
of criteria (which can be costly to call). While the factory is not expected to 
implement a caching  strategy it is expected to implement hashcode/equals 
simply to aid any caching services which would need this help to identify 
previously instantiated sets which share the same criteria as ant new requests 
(This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter 
would be example implementations). 
2) An object representing an instantiated set of document ids which can be 
cached and can create iterators for use in seperate threads (identified as my 
DocIdSet -  example implementations being called something like BitSetDocSet, 
SortedVIntList) 
3) An iterator for a set of document ids (my DocIdSetIterator - example impls 
being called something like BitSetDocSetIterator SortedVIntListIterator)

Each type of functionality can have different implementations so the 
functionality must be defined using an interface or abstract class. 
If we can agree this much as a set of responsibilities then we can begin to map 
these services onto something more concrete.


Cheers
Mark






> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-08-09 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518858
 ] 

Paul Elschot commented on LUCENE-584:
-

Mark,

I think we are one the same line, it's just that I don't want to go that far 
now.
Have another look at the title of this issue, it may be in your title bar, but 
otherwise 
it's quite a bit of scrolling so I'll repeat it here: "Decouple Filter from 
BitSet". 
That is the main thing that this patch tries to do.

And that also makes it a starting point for caching of different data 
structures for Filters.
Caching of Filters is very much needed, but I'd rather see that as another 
issue.

The DefaultMatcher class tries to do some compression by using a SortedVIntList 
when that is smaller than a BitSet, and that is about as far as I'd like to go 
now.

Proost,
Paul Elschot


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-08-09 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518868
 ] 

Mark Harwood commented on LUCENE-584:
-

OK, I appreciate caching may not be a top priority in this proposal but I have 
live systems in production using XMLQueryParser and which use the existing core 
facilities for caching. As it stands this proposal breaks this functionality 
(see "FIXME" in contrib's CachedFilterBuilder and my concerns over use of  
unthreadsafe Matcher in the core class CachingWrapperFilter)

I am obviously concerned by this and keen to help shape a solution which 
preserves the existing capabilities while adding your new functionality. I'm 
not sure I share your view that support for caching can be treated as a 
separate issue to be dealt with at a later date. There are a larger number of 
changes proposed in this patch and if the design does not at least consider 
future caching issues now, I suspect much will have to be reworked later. The 
change I can envisage most clearly is expressed in my concern that the DocIdSet 
and DocIdSetIterator services I outlined are being combined in Matcher as it 
stands now and these functions will have to be separated.

Cheers
Mark

> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, 
> Matcher1-ground-20070730.patch, Matcher2-default-20070730.patch, 
> Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch, 
> Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, 
> Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]