On Jul 18, 2005, at 11:14 AM, Jeff Shell wrote:
I'm working on a simple application which is the first time I get to
use the catalog in Zope 3. I'm writing against Zope 3.1b1. I was
dismayed not to see KeywordIndex in the main catalog set, but then I
found it in zope.index.keyword. But it seems to be a bit behind.
Hi. Yes, we needed it too. Here's another thing we want to open
source. Look at the attached .txt file; if you want it then tell me
and I'll make it available in a sandbox. We'll move it over into the
Zope repo (probably with a new name, or rearranged on the appropriate
locations (zope.index and zope.app.catalog, etc.) RSN.
Downsides:
- Note that some functionality requires that you use an extent
catalog, another goodie in the package.
- We have some refactoring of this that we want to do. We'll have
legacy issues ourselves, then.
Additional upside:
- This package also includes a replacement for the field index
(called a value index) and customizations of the value and set
indexes specific to timezone-aware datetimes, as well as a few other
things.
Gary
The setindex is an index similar to, but more general than a traditional
keyword index. The values indexed are expected to be iterables; the index
allows searches for documents that contain any of a set of values; all of a set
of values; or between a set of values.
Additionally, the index supports an interface that allows examination of the
indexed values.
It is as policy-free as possible, and is intended to be the engine for indexes
with more policy, as well as being useful itself.
On creation, the index has no wordCount, no documentCount, and is, as
expected, fairly empty.
>>> from zc.catalog.index import SetIndex
>>> index = SetIndex()
>>> index.documentCount()
0
>>> index.wordCount()
0
>>> index.maxValue() # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError:...
>>> index.minValue() # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError:...
>>> list(index.values())
[]
>>> len(index.apply({'any_of': (5,)}))
0
The index supports indexing any value. All values within a given index must
sort consistently across Python versions. In our example, we hope that strings
and integers will sort consistently; this may not be a reasonable hope.
>>> data = {1: ['a', 1],
... 2: ['b', 'a', 3, 4, 7],
... 3: [1],
... 4: [1, 4, 'c'],
... 5: [7],
... 6: [5, 6, 7],
... 7: ['c'],
... 8: [1, 6],
... 9: ['a', 'c', 2, 3, 4, 6,],
... }
>>> for k, v in data.items():
... index.index_doc(k, v)
...
After indexing, the statistics and values match the newly entered content.
>>> list(index.values())
[1, 2, 3, 4, 5, 6, 7, 'a', 'b', 'c']
>>> index.documentCount()
9
>>> index.wordCount()
10
>>> index.maxValue()
'c'
>>> index.minValue()
1
>>> list(index.ids())
[1, 2, 3, 4, 5, 6, 7, 8, 9]
The index supports five types of query. The first is 'any_of'. It
takes an iterable of values, and returns an iterable of document ids that
contain any of the values. The results are weighted.
>>> list(index.apply({'any_of':('b', 1, 5)}))
[1, 2, 3, 4, 6, 8]
>>> list(index.apply({'any_of': ('b', 1, 5)}))
[1, 2, 3, 4, 6, 8]
>>> list(index.apply({'any_of':(42,)}))
[]
>>> index.apply({'any_of': ('a', 3, 7)})
BTrees._IFBTree.IFBucket([(1, 1.0), (2, 3.0), (5, 1.0), (6, 1.0), (9, 2.0)])
Another query is 'qny', If the key is None, all indexed document ids with any
values are returned. If the key is an extent, the intersection of the extent
and all document ids with any values is returned.
>>> list(index.apply({'any': None}))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> from zc.catalog.extentcatalog import FilterExtent
>>> extent = FilterExtent(lambda extent, uid, obj: True)
>>> for i in range(15):
... extent.add(i, i)
...
>>> list(index.apply({'any': extent}))
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> limited_extent = FilterExtent(lambda extent, uid, obj: True)
>>> for i in range(5):
... limited_extent.add(i, i)
...
>>> list(index.apply({'any': limited_extent}))
[1, 2, 3, 4]
The 'contains_all' argument also takes an iterable of values, but returns an
iterable of document ids that contains all of the values. The results are not
weighted.
>>> list(index.apply({'all_of': ('a',)}))
[1, 2, 9]
>>> list(index.apply({'all_of': (3, 4)}))
[2, 9]
The 'between' argument takes from 1 to four values. The first is the
minimum, and defaults to None, indicating no minimum; the second is the
maximum, and defaults to None, indicating no maximum; the next is a boolean for
whether the minimum value should be excluded, and defaults to False; and the
last is a boolean for whether the maximum value should be excluded, and also
defaults to False. The results are weighted.
>>> list(index.apply({'between': (1, 7)}))
[1, 2, 3, 4, 5, 6, 8, 9]
>>> list(index.apply({'between': ('b', None)}))
[2, 4, 7, 9]
>>> list(index.apply({'between': ('b',)}))
[2, 4, 7, 9]
>>> list(index.apply({'between': (1, 7, True, True)}))
[2, 4, 6, 8, 9]
>>> index.apply({'between': (2, 6)})
BTrees._IFBTree.IFBucket([(2, 2.0), (4, 1.0), (6, 2.0), (8, 1.0), (9, 4.0)])
The 'none' argument takes an extent and returns the ids in the extent
that are not indexed; it is intended to be used to return docids that have
no (or empty) values.
>>> list(index.apply({'none': extent}))
[0, 10, 11, 12, 13, 14]
Trying to use more than one of these at a time generates an error.
>>> index.apply({'all_of': (5,), 'any_of': (3,)})
... # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError:...
Using none of them simply returns None.
>>> index.apply({}) # returns None
Invalid query names cause ValueErrors.
>>> index.apply({'foo':()})
... # doctest: +ELLIPSIS
Traceback (most recent call last):
...
ValueError:...
When you unindex a document, the searches and statistics should be updated.
>>> index.unindex_doc(6)
>>> len(index.apply({'any_of': (5,)}))
0
>>> index.documentCount()
8
>>> index.wordCount()
9
>>> list(index.values())
[1, 2, 3, 4, 6, 7, 'a', 'b', 'c']
>>> list(index.ids())
[1, 2, 3, 4, 5, 7, 8, 9]
Reindexing a document that has new additional values also is reflected in
subsequent searches and statistic checks.
>>> data[8].extend([5, 'c'])
>>> index.index_doc(8, data[8])
>>> index.documentCount()
8
>>> index.wordCount()
10
>>> list(index.apply({'any_of': (5,)}))
[8]
>>> list(index.apply({'any_of': ('c',)}))
[4, 7, 8, 9]
The same is true for reindexing a document with both additions and removals.
>>> 2 in set(index.apply({'any_of': (7,)}))
True
>>> 2 in set(index.apply({'any_of': (2,)}))
False
>>> data[2].pop()
7
>>> data[2].append(2)
>>> index.index_doc(2, data[2])
>>> 2 in set(index.apply({'any_of': (7,)}))
False
>>> 2 in set(index.apply({'any_of': (2,)}))
True
Reindexing a document that no longer has any values causes it to be removed
from the statistics.
>>> del data[2][:]
>>> index.index_doc(2, data[2])
>>> index.documentCount()
7
>>> index.wordCount()
9
>>> list(index.ids())
[1, 3, 4, 5, 7, 8, 9]
This affects both ways of determining the ids that are and are not in the index
(that do and do not have values).
>>> list(index.apply({'any': None}))
[1, 3, 4, 5, 7, 8, 9]
>>> list(index.apply({'none': extent}))
[0, 2, 6, 10, 11, 12, 13, 14]
The values method can be used to examine the indexed values for a given
document id.
>>> set(index.values(doc_id=8)) == set([1, 5, 6, 'c'])
True
And the containsValue method provides a way of determining membership in the
values.
>>> index.containsValue(5)
True
>>> index.containsValue(20)
False
_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com