Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
On Mon, Aug 24, 2009 at 1:15 PM, David Gerard dger...@gmail.com wrote: Not me, the guy who did the website :-) It did occur to me to wonder if he'd just reinvented PageRank from first mathematical principles ... I'm pretty sure the mathematics of PageRank are pretty well known. My linear algebra textbook (Lax's Linear Algebra and Its Applications) has a whole chapter on matrices with positive entries -- [[Perron's Theorem]], principal eigenvectors and so on, exactly what these guys are talking about. The chapter even has a sentence somewhere in there along the lines of This is the principle underlying Google's search algorithm. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
David Gerard dger...@gmail.com wrote in message news:fbad4e140908241015w1a6aa836jcf34ca962e088...@mail.gmail.com... 2009/8/24 Jay Litwyn brewh...@freenet.edmonton.ab.ca: [http://en.wikipedia.org/wiki/PageRank This is an approximation of what David Gerard has arrived at with his own method, Mister Johnson.] Not me, the guy who did the website :-) It did occur to me to wonder if he'd just reinvented PageRank from first mathematical principles We, the developers and copyright holders, are not responsible for your believing a word of this, or implementing the contents of this. We didn't sell it to you. It is public knowledge or misinformation as defined by you under a href=http://www.gnu.org/copyleft/gpl.html;the Free Software Foundation's and GNU public license/a authenticity control. Any similarity between this document and existing patents is unintentional and purely coincidental. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
wjhon...@aol.com wrote in message news://news.gmane.org/d55.57c09b43.37c33...@aol.com... In a message dated 8/23/2009 4:53:57 AM Pacific Daylight Time, brewh...@freenet.edmonton.ab.ca writes: The search for bees and flowers suggests pollination. I do not see anything mindless about that. That is a human association - You're not understanding me. An article discussing bees and mentioning that they pollinate flowers IS a human association. I didn't say it wasn't. However the meta-network of *all* such associations to the nth degree of relatedness is not something a human can encompass in one bite. That's one thing. What I was stating is that this meta-network itself, is created by a computer algorithm, which ITSELF has no mind. It has no idea what the terms mean, or refer to, or imply. It only knows that they are associated in some way. It creates this meta-network and ranks the associations in a mindless way, i.e. without comprehension. That's what I meant. People maintain the database (or meta-network, as you call it)--a collection of data and pointers, of words and associations. It is not important that the machine has no comprehension of links (pointers) in the database or eigenvectors that it is calculating. As long as human input is reprezented in that database, there is a foundation. Yes, it is a mechanical process, like anything on computers, complete with errors and an incomplete understanding of idiom. The point is that it delivers the impression of a smart search. ___ Cat Zen Master: What is the sound of one paw slashing? ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
[http://en.wikipedia.org/wiki/PageRank This is an approximation of what David Gerard has arrived at with his own method, Mister Johnson.] ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
2009/8/24 Jay Litwyn brewh...@freenet.edmonton.ab.ca: [http://en.wikipedia.org/wiki/PageRank This is an approximation of what David Gerard has arrived at with his own method, Mister Johnson.] Not me, the guy who did the website :-) It did occur to me to wonder if he'd just reinvented PageRank from first mathematical principles ... - d. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
wjhon...@aol.com wrote in message news:cfe.5d50bcc3.37c1a...@aol.com... In a message dated 8/22/2009 10:56:20 AM Pacific Daylight Time, dger...@gmail.com writes: Because there is no need to determine what the meaning of the particular term or keyword is, the pages it returns generally deal with the same concept or concepts that you entered. For instance, if you enter Flower and Bee, it will find pages where these two concepts overlap - those are pages about pollination. --- This seems big to me. It's creating, in a mindless way, semantic relationships between keywords. The search for bees and flowers suggests pollination. I do not see anything mindless about that. That is a human association. In another one, honey comes from sap in flowers, and gets flavour from them. So, the idea is to rank words-connecting-each higher than the AND-search alone, while the AND-search gets a higher rank than the OR-search. Works for me. You can get similar results on web pages if users do a good job of filling out descriptions, keywords, classification, and title tags. Pollination and honey should be at the top. This has been thought about for a long time it seems, but no one has really solved the annoying issue of how to avoid most false positives. I don't think you can avoid them all because English is so ambiguous but the use of cross-links is a major leap forward. Very few people are going to link-up concepts that are basely minor, but scan all pages for the links highlights the semantic connetions between concepts. You could even take it one step further, use the semantic web to point out semantic connections that are not directly obvious. Such as a leap from beekeeper to honeycomb. Try to do that using Google. You get thousands of bad hits before you get the one good one. Search for Hillbillies and Movie, using a semantic web you get the exact hit you want. W.J. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
In a message dated 8/23/2009 4:53:57 AM Pacific Daylight Time, brewh...@freenet.edmonton.ab.ca writes: The search for bees and flowers suggests pollination. I do not see anything mindless about that. That is a human association - You're not understanding me. An article discussing bees and mentioning that they pollinate flowers IS a human association. I didn't say it wasn't. However the meta-network of *all* such associations to the nth degree of relatedness is not something a human can encompass in one bite. That's one thing. What I was stating is that this meta-network itself, is created by a computer algorithm, which ITSELF has no mind. It has no idea what the terms mean, or refer to, or imply. It only knows that they are associated in some way. It creates this meta-network and ranks the associations in a mindless way, i.e. without comprehension. That's what I meant. W.J. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com wrote: I tried this out the other day; it's a very cool idea, but by and large, it seems that this hacker doesn't have enough CPU power to extract the really good wikilinks, the ones that aren't already linked inside the article. (eg. if I try it on [[Encyclopedia of the Brethren of Purity]], I have to go all the way down to find a suggestion which isn't already linked by the article.) Perhaps in a decade we'll have enough computing power on the servers that this could be a plugin - we'd then have auto-generated See Alsos, which would be really cool. -- gwern A fancy technique called Latent Dirichlet Allocation can be used to find links that aren't already linked in the document themselves. I did this for a class project. Here is an expert from the paper which also shows you the latent connections it found for the Simple article on hippies. http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png I note that Google has released parallel lda so its not feasible to run it on all of wikipedia using an ordinary Beowulf cluster. http://code.google.com/p/plda/ ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
On Sat, Aug 22, 2009 at 12:24 PM, Brian brian.min...@colorado.edu wrote: On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com wrote: I tried this out the other day; it's a very cool idea, but by and large, it seems that this hacker doesn't have enough CPU power to extract the really good wikilinks, the ones that aren't already linked inside the article. (eg. if I try it on [[Encyclopedia of the Brethren of Purity]], I have to go all the way down to find a suggestion which isn't already linked by the article.) Perhaps in a decade we'll have enough computing power on the servers that this could be a plugin - we'd then have auto-generated See Alsos, which would be really cool. -- gwern A fancy technique called Latent Dirichlet Allocation can be used to find links that aren't already linked in the document themselves. I did this for a class project. Here is an expert from the paper which also shows you the latent connections it found for the Simple article on hippies. http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png I note that Google has released parallel lda so its not feasible to run it on all of wikipedia using an ordinary Beowulf cluster. http://code.google.com/p/plda/ * now feasible ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
I have a feeling a lot of those are duplications of templates placed on a page - Macbeth linking to Romeo and Juliet (and vice versa) was my first example. Multiple search terms would seem to be the real place this would be useful, to minimize crossover from templates. ~A On Sat, Aug 22, 2009 at 14:25, Brian brian.min...@colorado.edu wrote: On Sat, Aug 22, 2009 at 12:24 PM, Brian brian.min...@colorado.edu wrote: On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com wrote: I tried this out the other day; it's a very cool idea, but by and large, it seems that this hacker doesn't have enough CPU power to extract the really good wikilinks, the ones that aren't already linked inside the article. (eg. if I try it on [[Encyclopedia of the Brethren of Purity]], I have to go all the way down to find a suggestion which isn't already linked by the article.) Perhaps in a decade we'll have enough computing power on the servers that this could be a plugin - we'd then have auto-generated See Alsos, which would be really cool. -- gwern A fancy technique called Latent Dirichlet Allocation can be used to find links that aren't already linked in the document themselves. I did this for a class project. Here is an expert from the paper which also shows you the latent connections it found for the Simple article on hippies. http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png I note that Google has released parallel lda so its not feasible to run it on all of wikipedia using an ordinary Beowulf cluster. http://code.google.com/p/plda/ * now feasible ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
Duplicating the function of templates - maybe not the best word to use. A better one might be misinterpreting. ~A On Sat, Aug 22, 2009 at 14:36, Brian brian.min...@colorado.edu wrote: On Sat, Aug 22, 2009 at 12:34 PM, Amory Meltzer amorymelt...@gmail.com wrote: I have a feeling a lot of those are duplications of templates placed on a page - Macbeth linking to Romeo and Juliet (and vice versa) was my first example. Multiple search terms would seem to be the real place this would be useful, to minimize crossover from templates. ~A Which duplications? ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l
Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia
In a message dated 8/22/2009 10:56:20 AM Pacific Daylight Time, dger...@gmail.com writes: Because there is no need to determine what the meaning of the particular term or keyword is, the pages it returns generally deal with the same concept or concepts that you entered. For instance, if you enter Flower and Bee, it will find pages where these two concepts overlap - those are pages about pollination. --- This seems big to me. It's creating, in a mindless way, semantic relationships between keywords. This has been thought about for a long time it seems, but no one has really solved the annoying issue of how to avoid most false positives. I don't think you can avoid them all because English is so ambiguous but the use of cross-links is a major leap forward. Very few people are going to link-up concepts that are basely minor, but scan all pages for the links highlights the semantic connetions between concepts. You could even take it one step further, use the semantic web to point out semantic connections that are not directly obvious. Such as a leap from beekeeper to honeycomb. Try to do that using Google. You get thousands of bad hits before you get the one good one. Search for Hillbillies and Movie, using a semantic web you get the exact hit you want. W.J. ___ WikiEN-l mailing list WikiEN-l@lists.wikimedia.org To unsubscribe from this mailing list, visit: https://lists.wikimedia.org/mailman/listinfo/wikien-l