Re: [Zope3-dev] KeywordIndex

Gary Poster Mon, 18 Jul 2005 08:42:42 -0700


On Jul 18, 2005, at 11:14 AM, Jeff Shell wrote:

I'm working on a simple application which is the first time I get to
use the catalog in Zope 3. I'm writing against Zope 3.1b1. I was
dismayed not to see KeywordIndex in the main catalog set, but then I
found it in zope.index.keyword. But it seems to be a bit behind.

Hi. Yes, we needed it too. Here's another thing we want to opensource. Look at the attached .txt file; if you want it then tell meand I'll make it available in a sandbox. We'll move it over into theZope repo (probably with a new name, or rearranged on the appropriatelocations (zope.index and zope.app.catalog, etc.) RSN.


Downsides:

- Note that some functionality requires that you use an extentcatalog, another goodie in the package.

- We have some refactoring of this that we want to do. We'll havelegacy issues ourselves, then.


Additional upside:

- This package also includes a replacement for the field index(called a value index) and customizations of the value and setindexes specific to timezone-aware datetimes, as well as a few otherthings.


Gary

The setindex is an index similar to, but more general than a traditional
keyword index.  The values indexed are expected to be iterables; the index
allows searches for documents that contain any of a set of values; all of a set
of values; or between a set of values.

Additionally, the index supports an interface that allows examination of the
indexed values.

It is as policy-free as possible, and is intended to be the engine for indexes
with more policy, as well as being useful itself.

On creation, the index has no wordCount, no documentCount, and is, as
expected, fairly empty.

    >>> from zc.catalog.index import SetIndex
    >>> index = SetIndex()
    >>> index.documentCount()
    0
    >>> index.wordCount()
    0
    >>> index.maxValue() # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...
    >>> index.minValue() # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...
    >>> list(index.values())
    []
    >>> len(index.apply({'any_of': (5,)}))
    0

The index supports indexing any value.  All values within a given index must
sort consistently across Python versions.  In our example, we hope that strings
and integers will sort consistently; this may not be a reasonable hope.

    >>> data = {1: ['a', 1],
    ...         2: ['b', 'a', 3, 4, 7],
    ...         3: [1],
    ...         4: [1, 4, 'c'],
    ...         5: [7],
    ...         6: [5, 6, 7],
    ...         7: ['c'],
    ...         8: [1, 6],
    ...         9: ['a', 'c', 2, 3, 4, 6,],
    ... }
    >>> for k, v in data.items():
    ...     index.index_doc(k, v)
    ...

After indexing, the statistics and values match the newly entered content. 

    >>> list(index.values())
    [1, 2, 3, 4, 5, 6, 7, 'a', 'b', 'c']
    >>> index.documentCount()
    9
    >>> index.wordCount()
    10
    >>> index.maxValue()
    'c'
    >>> index.minValue()
    1
    >>> list(index.ids())
    [1, 2, 3, 4, 5, 6, 7, 8, 9]

The index supports five types of query.  The first is 'any_of'.  It
takes an iterable of values, and returns an iterable of document ids that
contain any of the values.  The results are weighted.

    >>> list(index.apply({'any_of':('b', 1, 5)}))
    [1, 2, 3, 4, 6, 8]
    >>> list(index.apply({'any_of': ('b', 1, 5)}))
    [1, 2, 3, 4, 6, 8]
    >>> list(index.apply({'any_of':(42,)}))
    []
    >>> index.apply({'any_of': ('a', 3, 7)})
    BTrees._IFBTree.IFBucket([(1, 1.0), (2, 3.0), (5, 1.0), (6, 1.0), (9, 2.0)])

Another query is 'qny', If the key is None, all indexed document ids with any
values are returned.  If the key is an extent, the intersection of the extent
and all document ids with any values is returned.

    >>> list(index.apply({'any': None}))
    [1, 2, 3, 4, 5, 6, 7, 8, 9]

    >>> from zc.catalog.extentcatalog import FilterExtent
    >>> extent = FilterExtent(lambda extent, uid, obj: True)
    >>> for i in range(15):
    ...     extent.add(i, i)
    ...
    >>> list(index.apply({'any': extent}))
    [1, 2, 3, 4, 5, 6, 7, 8, 9]
    >>> limited_extent = FilterExtent(lambda extent, uid, obj: True)
    >>> for i in range(5):
    ...     limited_extent.add(i, i)
    ...
    >>> list(index.apply({'any': limited_extent}))
    [1, 2, 3, 4]

The 'contains_all' argument also takes an iterable of values, but returns an
iterable of document ids that contains all of the values.  The results are not
weighted.

    >>> list(index.apply({'all_of': ('a',)}))
    [1, 2, 9]
    >>> list(index.apply({'all_of': (3, 4)}))
    [2, 9]

The 'between' argument takes from 1 to four values.  The first is the 
minimum, and defaults to None, indicating no minimum; the second is the 
maximum, and defaults to None, indicating no maximum; the next is a boolean for
whether the minimum value should be excluded, and defaults to False; and the
last is a boolean for whether the maximum value should be excluded, and also
defaults to False.  The results are weighted.

    >>> list(index.apply({'between': (1, 7)}))
    [1, 2, 3, 4, 5, 6, 8, 9]
    >>> list(index.apply({'between': ('b', None)}))
    [2, 4, 7, 9]
    >>> list(index.apply({'between': ('b',)}))
    [2, 4, 7, 9]
    >>> list(index.apply({'between': (1, 7, True, True)}))
    [2, 4, 6, 8, 9]
    >>> index.apply({'between': (2, 6)})
    BTrees._IFBTree.IFBucket([(2, 2.0), (4, 1.0), (6, 2.0), (8, 1.0), (9, 4.0)])

The 'none' argument takes an extent and returns the ids in the extent
that are not indexed; it is intended to be used to return docids that have
no (or empty) values.

    >>> list(index.apply({'none': extent}))
    [0, 10, 11, 12, 13, 14]

Trying to use more than one of these at a time generates an error.

    >>> index.apply({'all_of': (5,), 'any_of': (3,)})
    ... # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...

Using none of them simply returns None.

    >>> index.apply({}) # returns None

Invalid query names cause ValueErrors.

    >>> index.apply({'foo':()})
    ... # doctest: +ELLIPSIS
    Traceback (most recent call last):
    ...
    ValueError:...

When you unindex a document, the searches and statistics should be updated.

    >>> index.unindex_doc(6)
    >>> len(index.apply({'any_of': (5,)}))
    0
    >>> index.documentCount()
    8
    >>> index.wordCount()
    9
    >>> list(index.values())
    [1, 2, 3, 4, 6, 7, 'a', 'b', 'c']
    >>> list(index.ids())
    [1, 2, 3, 4, 5, 7, 8, 9]

Reindexing a document that has new additional values also is reflected in 
subsequent searches and statistic checks.

    >>> data[8].extend([5, 'c'])
    >>> index.index_doc(8, data[8])
    >>> index.documentCount()
    8
    >>> index.wordCount()
    10
    >>> list(index.apply({'any_of': (5,)}))
    [8]
    >>> list(index.apply({'any_of': ('c',)}))
    [4, 7, 8, 9]

The same is true for reindexing a document with both additions and removals.

    >>> 2 in set(index.apply({'any_of': (7,)}))
    True
    >>> 2 in set(index.apply({'any_of': (2,)}))
    False
    >>> data[2].pop()
    7
    >>> data[2].append(2)
    >>> index.index_doc(2, data[2])
    >>> 2 in set(index.apply({'any_of': (7,)}))
    False
    >>> 2 in set(index.apply({'any_of': (2,)}))
    True

Reindexing a document that no longer has any values causes it to be removed
from the statistics.

    >>> del data[2][:]
    >>> index.index_doc(2, data[2])
    >>> index.documentCount()
    7
    >>> index.wordCount()
    9
    >>> list(index.ids())
    [1, 3, 4, 5, 7, 8, 9]

This affects both ways of determining the ids that are and are not in the index
(that do and do not have values).

    >>> list(index.apply({'any': None}))
    [1, 3, 4, 5, 7, 8, 9]
    >>> list(index.apply({'none': extent}))
    [0, 2, 6, 10, 11, 12, 13, 14]

The values method can be used to examine the indexed values for a given 
document id.

    >>> set(index.values(doc_id=8)) == set([1, 5, 6, 'c'])
    True

And the containsValue method provides a way of determining membership in the
values.

    >>> index.containsValue(5)
    True
    >>> index.containsValue(20)
    False

_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] KeywordIndex

Reply via email to