On Sat, Dec 11, 2010 at 12:02 AM, <wjhon...@aol.com> wrote: > In a message dated 12/10/2010 2:58:08 PM Pacific Standard Time, > jamesmikedup...@googlemail.com writes: > > > my idea was that you will want to search pages that are referenced by > wikipedia already, in my work on kosovo, it would be very helpful > because there are lots of bad results on google, and it would be nice > to use that also to see how many times certain names occur. > That is why we need also our own indexing engine, I would like to > count the occurances of each term and what page they occur on, and to > xref that to names on wikipedia against them. Wanted pages could also > be assisted like this, what are the most wanted pages that match > against the most common terms in the new refindex or also existing > pages. > > > > Well then all you would need to do is cross-reference the refs themselves. > You don't need to cache the underlying pages to which they refer.
well i was hoping to look at all the pages that wikipedia considers to be valuable enough to be referenced, and to find new information on those pages for other articles. I dont think it is enough to just look at the referernces on the wikipedia itself, we should resolve them and look at those pages, and also to build a list of sites of possible full indexing, or at least some spidering. > > So in your new search engine, when you search for "Mary, Queen of Scots" you > really are saying, show me those external references, which are mentioned, > in connection with Mary Queen of Scots, by Wikipedia. Not really, find all pages referenced in total by the wikipedia that contain the term "Mary, Queen of Scots", maybe someone added a site to an article on King Henry that contains the text "Mary, Queen of Scots" that has not been referenced yet. show me the occurrences of the word, the frequency, maybe in the sentence or paragraph it occurs in and a link to the page and the ability to see the cached version if the site is down. it can also be cached on another site as well, if the same version. > > That doesn't require caching the pages to which refs refer. It only > requires indexing those refs which currently are used in-world. Well indexing normally means caching as well, public or private. You need to copy the pages into the memory of a computer to index them. Best is to store them on disk. The first step will be to collect all references of course, but the second step will be to resolve them.This is also good to check for dead references and mark them as such. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l