Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Tue, 19 Jun 2001, Chris Withers wrote: > I'm guessing this is the point at which your problems become mine? ;-) *evil laughter* Yes :-) We should write about it and publish it to the community... ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
> On Mon, 18 Jun 2001, Andreas Jung wrote: > > > These are good ideas to improve the TextIndex. I already encouraged > > Erik to put alltogether into a Fishbowl proposal, > > Which I would do, if I had time. Which I will have, but not for another > two weeks. :-) I'm guessing this is the point at which your problems become mine? ;-) *grinz* Chris ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Mon, 18 Jun 2001, Andreas Jung wrote: > These are good ideas to improve the TextIndex. I already encouraged > Erik to put alltogether into a Fishbowl proposal, Which I would do, if I had time. Which I will have, but not for another two weeks. :-) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
> > Rik Hoekstra writes: > > This raises the question how dependent the splitter on the paticularities of the > > document source - I do not really see how different splitters could be useful > > for one single document. This is perhaps less obvious than it appears, as you > > may want to use different splitters for documents in different languages. Taken > > as a whole I would say choosing a splitter would be a decision that had to be > > taken at indexing time anyway. But perhaps it's just my imagination that is > > Of couse, the search must follow the same splitting rules > than the indexing did. Changing the rules (the splitter > or its configuration) after indexing will make the index > inconsistent. > I agree; in fact I think we're saying the same. What is more interesting, is how (less than when) you decide to use which splitter. With heterogeneous documents I'd think it would be difficult to decide automagically... Rik ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
These are good ideas to improve the TextIndex. I already encouraged Erik to put alltogether into a Fishbowl proposal, Andreas - Original Message - From: "Dieter Maurer" <[EMAIL PROTECTED]> To: "Rik Hoekstra" <[EMAIL PROTECTED]> Cc: "Chris McDonough" <[EMAIL PROTECTED]>; "Erik Enge" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, June 18, 2001 4:59 PM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited) > Rik Hoekstra writes: > > This raises the question how dependent the splitter on the paticularities of the > > document source - I do not really see how different splitters could be useful > > for one single document. This is perhaps less obvious than it appears, as you > > may want to use different splitters for documents in different languages. Taken > > as a whole I would say choosing a splitter would be a decision that had to be > > taken at indexing time anyway. But perhaps it's just my imagination that is > > lacking. > There are lots of things you may want to change based on > experience with your index: > > * change the set of token boundary characters > they define, where words are broken out. > > * change the set of removed characters > they are removed from the words, usually for > normalization. > > In German, e.g., you can write both "Auto-Lackierer" > and "Autolackierer". You want to normalize > these different spellings. > > * change the set of "composing" characters > > German is very rich in composite terms. > You may want to index under each component term. > For this, you need the rules on how the composition > is build. > For text, it is usually '-'. But if you have > computer sources, '_' or ':' may be relevant, too. > > Of couse, the search must follow the same splitting rules > than the indexing did. Changing the rules (the splitter > or its configuration) after indexing will make the index > inconsistent. > > > Dieter > > ___ > Zope-Dev maillist - [EMAIL PROTECTED] > http://lists.zope.org/mailman/listinfo/zope-dev > ** No cross posts or HTML encoding! ** > (Related lists - > http://lists.zope.org/mailman/listinfo/zope-announce > http://lists.zope.org/mailman/listinfo/zope ) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Rik Hoekstra writes: > This raises the question how dependent the splitter on the paticularities of the > document source - I do not really see how different splitters could be useful > for one single document. This is perhaps less obvious than it appears, as you > may want to use different splitters for documents in different languages. Taken > as a whole I would say choosing a splitter would be a decision that had to be > taken at indexing time anyway. But perhaps it's just my imagination that is > lacking. There are lots of things you may want to change based on experience with your index: * change the set of token boundary characters they define, where words are broken out. * change the set of removed characters they are removed from the words, usually for normalization. In German, e.g., you can write both "Auto-Lackierer" and "Autolackierer". You want to normalize these different spellings. * change the set of "composing" characters German is very rich in composite terms. You may want to index under each component term. For this, you need the rules on how the composition is build. For text, it is usually '-'. But if you have computer sources, '_' or ':' may be relevant, too. Of couse, the search must follow the same splitting rules than the indexing did. Changing the rules (the splitter or its configuration) after indexing will make the index inconsistent. Dieter ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
The Splitter interface is not really document. However Zope 2.4 has a much better support for 3rd party splitters. Andreas - Original Message - From: "R. David Murray " <[EMAIL PROTECTED]> To: "Chris McDonough" <[EMAIL PROTECTED]> Cc: "Erik Enge" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, June 18, 2001 11:39 AM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited) > On Sun, 17 Jun 2001, Chris McDonough wrote: > > index_object, because the splitter return has all the words > > in order, even the dupes... as you iterate, you can mutate > > Is this part of the current formal Splitter Interface? If not, > it needs to be if other code is going to depend on it. > > Oh, yeah, and where is the formal Splitter interface documented ? > I don't see anything in SearchIndex, and a search for "splitter interface" > on zope.org didn't turn up anything useful. > > --RDM > > > ___ > Zope-Dev maillist - [EMAIL PROTECTED] > http://lists.zope.org/mailman/listinfo/zope-dev > ** No cross posts or HTML encoding! ** > (Related lists - > http://lists.zope.org/mailman/listinfo/zope-announce > http://lists.zope.org/mailman/listinfo/zope ) > ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Sun, 17 Jun 2001, Chris McDonough wrote: > index_object, because the splitter return has all the words > in order, even the dupes... as you iterate, you can mutate Is this part of the current formal Splitter Interface? If not, it needs to be if other code is going to depend on it. Oh, yeah, and where is the formal Splitter interface documented ? I don't see anything in SearchIndex, and a search for "splitter interface" on zope.org didn't turn up anything useful. --RDM ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
> > > Once you're satisfied with the implementation, would you be willing > > submit the module to the collector? > > Do you think you (or someone else for that matter) could have a look at > [1] the method that returns the position in the document - positionInDoc() > - to how that could be made to run much faster? Maybe it is how it > used... It is too slow to be very useful when indexing large amounts of > data. > > Anyway, I suck at making Python fast (or using it the right way, which > ever I've fallen pray for this time ;-), and any hints would be greatly > appretiated. > > I've been indexing and searching a lot this weekend, and bar that problem > with the indexing-speed it seems ok and I have no issues submitting it to > the Collector. > Doing something similar (in fact what I needed was citations of word usage) I took a two step approach, with the idea that most of the actual returning of results would have to be done on a much smaller subset of documents than if you'd have to index all documents with word indexes and positions. I use a normal textindex for querying. Then if a document is returned by the query I start processing the documents. This requires parsing the query in a slightly different way (throw out the NOTs). The two step approach has the advantage that you can postpone processing actual documents until you return the results for the specific documents. Using your positionInDoc will require a _lot_ of processing (why does it use string.split btw and not Splitter?; why split on " " and not on string.whitespace?). I have used string.find for finding word positions, which is probably faster than looping a list of words. BTW, I'd rather use Splitter, but word positions appeared not to be reliable (bug, or something I didn't understand; anyhow, string.find works for me and is fast) def splitit(txt, word): postions = [] start = 0 while 1: res = string.find(txt, word, start) if res is -1: break else: start = res+1 postions.append(res) return postions Perhaps using re would perhaps also be an option, but allowing regular expressions will complicate searching a lot, so I use globbing lexicon for expanding and then do the matching on the expanded items (if necessary - not if using [wordpart]*) Advantages of using this approach: - it's faster. - it splits up the query processing part in different subparts which also contributes to speeding things up. - it's also more flexible, as you can divide searching and parsing over different webrequests, and even make them dependend on the number of results. For example: why return text fragments from all documents if your users will not be able to see all the results anyway. Or why return all fragments containing word combinations from one single document while returning a few occurrences from different documents is more useful for your users. Note that this will mainly affect returning text fragments, which may or may not be useful. There's also a couple of disadvantages (as I see them , but there may be more): - it only works with exact word positions and not numbers in a text. The within two words approach may be remedied by using string.split on substrings however if really needed. Depending on you purposes an even rougher approach is by taking some default length for words (this is a bit faster). These are not very elegant solutions, though. - because of an approach that is not so coupled with (Z)Catalog, integration strategies are less obvious (at least for me) - the positionIndex might be used for further processing as is, in my approach this is less obvious. another 2 cents Rik ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Chris McDonough wrote: > > It just occurred to me that depending on the splitter to do > positions makes it impossible to alter the splitter without > reindexing the whole text index... but I think this is a > reasonable tradeoff. Other opinions welcome. > This raises the question how dependent the splitter on the paticularities of the document source - I do not really see how different splitters could be useful for one single document. This is perhaps less obvious than it appears, as you may want to use different splitters for documents in different languages. Taken as a whole I would say choosing a splitter would be a decision that had to be taken at indexing time anyway. But perhaps it's just my imagination that is lacking. There is a much greater dependence on the lexicon here. And indeed several different lexicons could be applied to a set of documents depending of what is wanted. my 2 cents Rik ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
RE: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
[EMAIL PROTECTED] writes: > A lot of folks who do "power searches," say, librarians or other trained > researchers, familiar with the bells and whistles of more powerful search > engines, will want a simple operator for proximity, with the ability to > specify proximity depth: > > For example: > > Lexis-Nexis: Sean w/2 Upton (where w/2 is within 2 words) > Also, lexis doesn't count stop-words in proximity > indexes. > Folio/Nextpage: "Sean Upton"@2 > > IMHO, the syntax is clean and very brief in the Lexis-Nexis case and should > suppliment a more generic > Sean ... Upton > style search. I do not think, it is a good idea to have an infix operator for proximity searches. This combines just 2 words but proximity searches may involve more than two words: a set of words, near together (e.g. in one paragraph, sentence, within x words). Dieter ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Sun, 17 Jun 2001 21:05:47 +0200 (CEST) Erik Enge <[EMAIL PROTECTED]> wrote: > On Fri, 15 Jun 2001, Chris McDonough wrote: > > > Once you're satisfied with the implementation, would > you be willing > > submit the module to the collector? > > Do you think you (or someone else for that matter) could > have a look at > [1] the method that returns the position in the document > - positionInDoc() > - to how that could be made to run much faster? Maybe it > is how it > used... It is too slow to be very useful when indexing > large amounts of > data. Erik, It looks like you call proximityInsert for each item returned from the splitter on the doc source. Instead of looking for the position in the source document by splitting the source up again within proximityInsert, you can keep a simple counter while you iterate over the splitter return in index_object, because the splitter return has all the words in order, even the dupes... as you iterate, you can mutate the position entry for that word/documentId pair within proximityInsert. You never actually need to manually split the document source, instead just always rely on the splitter to bust up the doc, and manipulate the position list in place. This is not the most efficient way, but it's more efficient than your current way. Therefore, the bit in index_object becomes: i = 0 for word in splitter(source): self.proximityInsert(word, documentId, i) i = i + 1 The proximityInsert method becomes: def proximityInsert(self, word, documentId, i): """Insert proximity information about this wid (word id) in the index' proximity bucket.""" wid=self.getWid(word) prox=self._proximity if not prox.has_key(wid): prox[wid]=IOBTree() prox[wid][documentId]=[i] self._p_changed = 1 else: if i in prox[wid][documentId]: return prox[wid][documentId].append(i) self._p_changed = 1 .. and the positionInDoc method goes away. I didn't scan too hard for what else in the source this would break. > Anyway, I suck at making Python fast (or using it the > right way, which > ever I've fallen pray for this time ;-), and any hints > would be greatly > appretiated. > > I've been indexing and searching a lot this weekend, and > bar that problem > with the indexing-speed it seems ok and I have no issues > submitting it to > the Collector. Cool... > > [1] http://nittin.net/erik/software/PositionIndex/PositionIndex.py> > ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
It just occurred to me that depending on the splitter to do positions makes it impossible to alter the splitter without reindexing the whole text index... but I think this is a reasonable tradeoff. Other opinions welcome. On Sun, 17 Jun 2001 15:57:20 -0400 "Chris McDonough" <[EMAIL PROTECTED]> wrote: > On Sun, 17 Jun 2001 21:05:47 +0200 (CEST) > Erik Enge <[EMAIL PROTECTED]> wrote: > > On Fri, 15 Jun 2001, Chris McDonough wrote: > > > > > Once you're satisfied with the implementation, would > > you be willing > > > submit the module to the collector? > > > > Do you think you (or someone else for that matter) > could > > have a look at > > [1] the method that returns the position in the > document > > - positionInDoc() > > - to how that could be made to run much faster? Maybe > it > > is how it > > used... It is too slow to be very useful when indexing > > large amounts of > > data. > > Erik, > > It looks like you call proximityInsert for each item > returned from the splitter on the doc source. Instead of > looking for the position in the source document by > splitting > the source up again within proximityInsert, you can keep > a > simple counter while you iterate over the splitter return > in > index_object, because the splitter return has all the > words > in order, even the dupes... as you iterate, you can > mutate > the position entry for that word/documentId pair within > proximityInsert. You never actually need to manually > split > the document source, instead just always rely on the > splitter to bust up the doc, and manipulate the position > list in place. This is not the most efficient way, but > it's > more efficient than your current way. > > Therefore, the bit in index_object becomes: > > i = 0 > for word in splitter(source): > self.proximityInsert(word, documentId, i) > i = i + 1 > > The proximityInsert method becomes: > > def proximityInsert(self, word, documentId, i): > """Insert proximity information about this wid (word > id) > in > the index' proximity bucket.""" > wid=self.getWid(word) > prox=self._proximity > if not prox.has_key(wid): > prox[wid]=IOBTree() > prox[wid][documentId]=[i] > self._p_changed = 1 > else: > if i in prox[wid][documentId]: return > prox[wid][documentId].append(i) > self._p_changed = 1 > > .. and the positionInDoc method goes away. > > I didn't scan too hard for what else in the source this > would break. > > > Anyway, I suck at making Python fast (or using it the > > right way, which > > ever I've fallen pray for this time ;-), and any hints > > would be greatly > > appretiated. > > > > I've been indexing and searching a lot this weekend, > and > > bar that problem > > with the indexing-speed it seems ok and I have no > issues > > submitting it to > > the Collector. > > Cool... > > > > > [1] http://nittin.net/erik/software/PositionIndex/PositionIndex.py> > > > ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Fri, 15 Jun 2001, Chris McDonough wrote: > Once you're satisfied with the implementation, would you be willing > submit the module to the collector? Do you think you (or someone else for that matter) could have a look at [1] the method that returns the position in the document - positionInDoc() - to how that could be made to run much faster? Maybe it is how it used... It is too slow to be very useful when indexing large amounts of data. Anyway, I suck at making Python fast (or using it the right way, which ever I've fallen pray for this time ;-), and any hints would be greatly appretiated. I've been indexing and searching a lot this weekend, and bar that problem with the indexing-speed it seems ok and I have no issues submitting it to the Collector. [1] http://nittin.net/erik/software/PositionIndex/PositionIndex.py> ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Sat, 16 Jun 2001 [EMAIL PROTECTED] wrote: > Lexis-Nexis: Sean w/2 Upton (where w/2 is within 2 words) This wouldn't be hard to make happen. I don't know if it is better to do it before of after the parsers, though. Maybe a more userfriendly alias would be best as a default? ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
RE: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
A lot of folks who do "power searches," say, librarians or other trained researchers, familiar with the bells and whistles of more powerful search engines, will want a simple operator for proximity, with the ability to specify proximity depth: For example: Lexis-Nexis:Sean w/2 Upton (where w/2 is within 2 words) Also, lexis doesn't count stop-words in proximity indexes. Folio/Nextpage: "Sean Upton"@2 IMHO, the syntax is clean and very brief in the Lexis-Nexis case and should suppliment a more generic Sean ... Upton style search. Sean -Original Message- From: Chris McDonough [mailto:[EMAIL PROTECTED]] Sent: Saturday, June 16, 2001 2:59 AM To: Erik Enge Cc: [EMAIL PROTECTED] Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited) Erik Enge wrote: > > On Fri, 15 Jun 2001, Chris McDonough wrote: > > > Once you're satisfied with the implementation, would you be willing > > submit the module to the collector? > > Will do. Have you thought about how users actually are to use > exact-phrase? What I'm thinking I will do here (currently I've only been > testing explicitly with "adjoinedby" in the query) is to insert > "adjoinedby" in phrased searches: > > "erik enge"-> erik adjoinedby enge > erik ... enge -> erik near enge > > What do you think? These both look like good spellings, and I think "erik near enge" would be a good alias for "erik ... enge" as well.. - C ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope ) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Erik Enge wrote: > > On Fri, 15 Jun 2001, Chris McDonough wrote: > > > Once you're satisfied with the implementation, would you be willing > > submit the module to the collector? > > Will do. Have you thought about how users actually are to use > exact-phrase? What I'm thinking I will do here (currently I've only been > testing explicitly with "adjoinedby" in the query) is to insert > "adjoinedby" in phrased searches: > > "erik enge"-> erik adjoinedby enge > erik ... enge -> erik near enge > > What do you think? These both look like good spellings, and I think "erik near enge" would be a good alias for "erik ... enge" as well.. - C ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Fri, 15 Jun 2001, Chris McDonough wrote: > Once you're satisfied with the implementation, would you be willing > submit the module to the collector? Will do. Have you thought about how users actually are to use exact-phrase? What I'm thinking I will do here (currently I've only been testing explicitly with "adjoinedby" in the query) is to insert "adjoinedby" in phrased searches: "erik enge"-> erik adjoinedby enge erik ... enge -> erik near enge What do you think? I'll be submitting PositionIndex.py and ResultList.py in a day or two. ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Erik, Once you're satisfied with the implementation, would you be willing submit the module to the collector? - C - Original Message - From: "Erik Enge" <[EMAIL PROTECTED]> To: "Chris McDonough" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, June 15, 2001 11:53 AM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited) > On Thu, 14 Jun 2001, Erik Enge wrote: > > > To be really useful I think the PossitionIndex' _proximity dictionary > > needs to be turned into a BTree of some sort, but apart from that I > > don't know what is missing. > > It's now using BTrees. And I renamed it to PositionIndex (thanks to > Chris Withers for this :-). > > > And speed might be a problem, haven't really tested that yet. Will > > during the weekend though. > > I indexed 30.000 objects using PositionIndex and searching (both > exact-phrase and near) is very fast. It doesn't seem to be bloated, > either (the _proximity-attribute, that is). > > Do you guys have a testing-suite for indexes? Maybe some I can apply to > this index of mine? > > > ___ > Zope-Dev maillist - [EMAIL PROTECTED] > http://lists.zope.org/mailman/listinfo/zope-dev > ** No cross posts or HTML encoding! ** > (Related lists - > http://lists.zope.org/mailman/listinfo/zope-announce > http://lists.zope.org/mailman/listinfo/zope ) > ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Thu, 14 Jun 2001, Erik Enge wrote: > To be really useful I think the PossitionIndex' _proximity dictionary > needs to be turned into a BTree of some sort, but apart from that I > don't know what is missing. It's now using BTrees. And I renamed it to PositionIndex (thanks to Chris Withers for this :-). > And speed might be a problem, haven't really tested that yet. Will > during the weekend though. I indexed 30.000 objects using PositionIndex and searching (both exact-phrase and near) is very fast. It doesn't seem to be bloated, either (the _proximity-attribute, that is). Do you guys have a testing-suite for indexes? Maybe some I can apply to this index of mine? ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Thu, 14 Jun 2001, Chris McDonough wrote: > Excellent! I haven't looked at it in detail, but thanks very much for > contributing it! Maybe we can roll some of this work into a > position-aware Text Index It is actually a TextIndex on steoroids. Remove the _proximity attribute and a couple of methods and what you are left with is a standard TextIndex. So I think what you already have is a position-aware TextIndex. That's how I'm planning to use it anyway :) > or maybe even a new kind of Pluggable Index. :-) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
Excellent! I haven't looked at it in detail, but thanks very much for contributing it! Maybe we can roll some of this work into a position-aware Text Index, or maybe even a new kind of Pluggable Index. - C - Original Message - From: "Erik Enge" <[EMAIL PROTECTED]> To: "Chris McDonough" <[EMAIL PROTECTED]> Cc: "Oren Yosifon" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Thursday, June 14, 2001 12:45 PM Subject: Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited) > On Thu, 14 Jun 2001, Erik Enge wrote: > > > Me got a patch: http://nittin.net/erik/software/PossitionIndex>. > > And I should mention that it has only been tested on Zope 2.3.2. > > (BTW, thanks, Chris, for suggesting how to code it.) > > ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )
Re: PossitionIndex (was: Re: [Zope-dev] ZCatalog phrase indexingrevisited)
On Thu, 14 Jun 2001, Erik Enge wrote: > Me got a patch: http://nittin.net/erik/software/PossitionIndex>. And I should mention that it has only been tested on Zope 2.3.2. (BTW, thanks, Chris, for suggesting how to code it.) ___ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )