Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-09-18 Thread Aryeh Gregor
On Mon, Aug 24, 2009 at 1:15 PM, David Gerard dger...@gmail.com wrote:
 Not me, the guy who did the website :-) It did occur to me to wonder
 if he'd just reinvented PageRank from first mathematical principles
 ...

I'm pretty sure the mathematics of PageRank are pretty well known.  My
linear algebra textbook (Lax's Linear Algebra and Its Applications)
has a whole chapter on matrices with positive entries -- [[Perron's
Theorem]], principal eigenvectors and so on, exactly what these guys
are talking about.  The chapter even has a sentence somewhere in there
along the lines of This is the principle underlying Google's search
algorithm.

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-09-17 Thread Jay Litwyn
David Gerard dger...@gmail.com wrote in message 
news:fbad4e140908241015w1a6aa836jcf34ca962e088...@mail.gmail.com...
 2009/8/24 Jay Litwyn brewh...@freenet.edmonton.ab.ca:

 [http://en.wikipedia.org/wiki/PageRank
 This is an approximation of what David Gerard has arrived at with his own
 method, Mister Johnson.]

 Not me, the guy who did the website :-) It did occur to me to wonder
 if he'd just reinvented PageRank from first mathematical principles

We, the developers and copyright holders, are not responsible for your
believing a word of this, or implementing the contents of this. We didn't
sell it to you. It is public knowledge or misinformation as defined by you
under a href=http://www.gnu.org/copyleft/gpl.html;the Free Software
Foundation's and GNU public license/a authenticity control. Any
similarity between this document and existing patents is unintentional and
purely coincidental. 




___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-24 Thread Jay Litwyn
wjhon...@aol.com wrote in message 
news://news.gmane.org/d55.57c09b43.37c33...@aol.com...
 In a message dated 8/23/2009 4:53:57 AM Pacific Daylight Time,
 brewh...@freenet.edmonton.ab.ca writes:


 The search for bees and flowers suggests pollination. I do not see
 anything mindless about that. That is a human association

 -

 You're not understanding me.  An article discussing bees and mentioning
 that they pollinate flowers IS a human association.  I didn't say it 
 wasn't.
 However the meta-network of *all* such associations to the nth degree of
 relatedness is not something a human can encompass in one bite.  That's 
 one
 thing.

 What I was stating is that this meta-network itself, is created by a
 computer algorithm, which ITSELF has no mind.  It has no idea what the 
 terms mean,
 or refer to, or imply.  It only knows that they are associated in some 
 way.
 It creates this meta-network and ranks the associations in a mindless way,
 i.e. without comprehension.  That's what I meant.

People maintain the database (or meta-network, as you call it)--a collection 
of
data and pointers, of words and associations. It is not important that the 
machine
has no comprehension of links (pointers) in the database or eigenvectors 
that
it is calculating. As long as human input is reprezented in that database, 
there is a
foundation. Yes, it is a mechanical process, like anything on computers, 
complete
with errors and an incomplete understanding of idiom. The point is that it 
delivers
the impression of a smart search.
___
Cat Zen Master: What is the sound of one paw slashing? 




___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-24 Thread Jay Litwyn
[http://en.wikipedia.org/wiki/PageRank
This is an approximation of what David Gerard has arrived at with his own 
method, Mister Johnson.] 




___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-24 Thread David Gerard
2009/8/24 Jay Litwyn brewh...@freenet.edmonton.ab.ca:

 [http://en.wikipedia.org/wiki/PageRank
 This is an approximation of what David Gerard has arrived at with his own
 method, Mister Johnson.]


Not me, the guy who did the website :-) It did occur to me to wonder
if he'd just reinvented PageRank from first mathematical principles
...


- d.

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-23 Thread Jay Litwyn
wjhon...@aol.com wrote in message news:cfe.5d50bcc3.37c1a...@aol.com...
 In a message dated 8/22/2009 10:56:20 AM Pacific Daylight Time,
 dger...@gmail.com writes:


 Because there is no need to determine what the meaning of
 the particular term or keyword is, the pages it returns generally deal
 with the same concept or concepts that you entered. For instance, if
 you enter Flower and Bee, it will find pages where these two
 concepts overlap - those are pages about pollination.
 ---

 This seems big to me.
 It's creating, in a mindless way, semantic relationships between keywords.

The search for bees and flowers suggests pollination. I do not see 
anything mindless about that. That is a human association. In another one, 
honey comes from sap in flowers, and gets flavour from them. So, the idea is 
to rank words-connecting-each higher than the AND-search alone, while the 
AND-search gets a higher rank than the OR-search. Works for me. You can get 
similar results on web pages if users do a good job of filling out 
descriptions, keywords, classification, and title tags. Pollination and 
honey should be at the top.

 This has been thought about for a long time it seems, but no one has 
 really
 solved the annoying issue of how to avoid most false positives.  I don't
 think you can avoid them all because English is so ambiguous but the use 
 of
 cross-links is a major leap forward.

 Very few people are going to link-up concepts that are basely minor, but
 scan all pages for the links highlights the semantic connetions between
 concepts.  You could even take it one step further, use the semantic web 
 to point
 out semantic connections that are not directly obvious.  Such as a leap
 from beekeeper to honeycomb.  Try to do that using Google.  You get 
 thousands
 of bad hits before you get the one good one.

 Search for Hillbillies and Movie, using a semantic web you get the
 exact hit you want.

 W.J.

 ___
 WikiEN-l mailing list
 WikiEN-l@lists.wikimedia.org
 To unsubscribe from this mailing list, visit:
 https://lists.wikimedia.org/mailman/listinfo/wikien-l
 




___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-23 Thread WJhonson
In a message dated 8/23/2009 4:53:57 AM Pacific Daylight Time, 
brewh...@freenet.edmonton.ab.ca writes:


 The search for bees and flowers suggests pollination. I do not see 
 anything mindless about that. That is a human association

-

You're not understanding me.  An article discussing bees and mentioning 
that they pollinate flowers IS a human association.  I didn't say it wasn't.  
However the meta-network of *all* such associations to the nth degree of 
relatedness is not something a human can encompass in one bite.  That's one 
thing.

What I was stating is that this meta-network itself, is created by a 
computer algorithm, which ITSELF has no mind.  It has no idea what the terms 
mean, 
or refer to, or imply.  It only knows that they are associated in some way. 
 It creates this meta-network and ranks the associations in a mindless way, 
i.e. without comprehension.  That's what I meant.

W.J.

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-22 Thread Brian
On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com wrote:


 I tried this out the other day; it's a very cool idea, but by and
 large, it seems that this hacker doesn't have enough CPU power to
 extract the really good wikilinks, the ones that aren't already linked
 inside the article. (eg. if I try it on [[Encyclopedia of the Brethren
 of Purity]], I have to go all the way down to find a suggestion which
 isn't already linked by the article.)

 Perhaps in a decade we'll have enough computing power on the servers
 that this could be a plugin - we'd then have auto-generated See Alsos,
 which would be really cool.

 --
 gwern


A fancy technique called Latent Dirichlet Allocation can be used to find
links that aren't already linked in the document themselves. I did this for
a class project. Here is an expert from the paper which also shows you the
latent connections it found for the Simple article on hippies.

http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png

I note that Google has released parallel lda so its not feasible to run it
on all of wikipedia using an ordinary Beowulf cluster.
http://code.google.com/p/plda/
___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-22 Thread Brian
On Sat, Aug 22, 2009 at 12:24 PM, Brian brian.min...@colorado.edu wrote:

 On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com wrote:


 I tried this out the other day; it's a very cool idea, but by and
 large, it seems that this hacker doesn't have enough CPU power to
 extract the really good wikilinks, the ones that aren't already linked
 inside the article. (eg. if I try it on [[Encyclopedia of the Brethren
 of Purity]], I have to go all the way down to find a suggestion which
 isn't already linked by the article.)

 Perhaps in a decade we'll have enough computing power on the servers
 that this could be a plugin - we'd then have auto-generated See Alsos,
 which would be really cool.

 --
 gwern


 A fancy technique called Latent Dirichlet Allocation can be used to find
 links that aren't already linked in the document themselves. I did this for
 a class project. Here is an expert from the paper which also shows you the
 latent connections it found for the Simple article on hippies.

 http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png

 I note that Google has released parallel lda so its not feasible to run it
 on all of wikipedia using an ordinary Beowulf cluster.
 http://code.google.com/p/plda/


* now feasible
___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-22 Thread Amory Meltzer
I have a feeling a lot of those are duplications of templates placed on a
page - Macbeth linking to Romeo and Juliet (and vice versa) was my first
example.  Multiple search terms would seem to be the real place this would
be useful, to minimize crossover from templates.

~A


On Sat, Aug 22, 2009 at 14:25, Brian brian.min...@colorado.edu wrote:

 On Sat, Aug 22, 2009 at 12:24 PM, Brian brian.min...@colorado.edu wrote:

  On Sat, Aug 22, 2009 at 12:05 PM, Gwern Branwen gwe...@gmail.com
 wrote:
 
 
  I tried this out the other day; it's a very cool idea, but by and
  large, it seems that this hacker doesn't have enough CPU power to
  extract the really good wikilinks, the ones that aren't already linked
  inside the article. (eg. if I try it on [[Encyclopedia of the Brethren
  of Purity]], I have to go all the way down to find a suggestion which
  isn't already linked by the article.)
 
  Perhaps in a decade we'll have enough computing power on the servers
  that this could be a plugin - we'd then have auto-generated See Alsos,
  which would be really cool.
 
  --
  gwern
 
 
  A fancy technique called Latent Dirichlet Allocation can be used to find
  links that aren't already linked in the document themselves. I did this
 for
  a class project. Here is an expert from the paper which also shows you
 the
  latent connections it found for the Simple article on hippies.
 
  http://upload.wikimedia.org/wikipedia/meta/2/25/LDA-Wiki-Search.png
 
  I note that Google has released parallel lda so its not feasible to run
 it
  on all of wikipedia using an ordinary Beowulf cluster.
  http://code.google.com/p/plda/
 

 * now feasible
 ___
 WikiEN-l mailing list
 WikiEN-l@lists.wikimedia.org
 To unsubscribe from this mailing list, visit:
 https://lists.wikimedia.org/mailman/listinfo/wikien-l

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-22 Thread Amory Meltzer
Duplicating the function of templates - maybe not the best word to use.  A
better one might be misinterpreting.

~A


On Sat, Aug 22, 2009 at 14:36, Brian brian.min...@colorado.edu wrote:

 On Sat, Aug 22, 2009 at 12:34 PM, Amory Meltzer amorymelt...@gmail.com
 wrote:

  I have a feeling a lot of those are duplications of templates placed on a
  page - Macbeth linking to Romeo and Juliet (and vice versa) was my first
  example.  Multiple search terms would seem to be the real place this
 would
  be useful, to minimize crossover from templates.
 
  ~A
 


 Which duplications?
 ___
 WikiEN-l mailing list
 WikiEN-l@lists.wikimedia.org
 To unsubscribe from this mailing list, visit:
 https://lists.wikimedia.org/mailman/listinfo/wikien-l

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l


Re: [WikiEN-l] SmartWikiSearch, a similarity search engine for Wikipedia

2009-08-22 Thread WJhonson
In a message dated 8/22/2009 10:56:20 AM Pacific Daylight Time, 
dger...@gmail.com writes:


 Because there is no need to determine what the meaning of
 the particular term or keyword is, the pages it returns generally deal
 with the same concept or concepts that you entered. For instance, if
 you enter Flower and Bee, it will find pages where these two
 concepts overlap - those are pages about pollination.
---

This seems big to me.
It's creating, in a mindless way, semantic relationships between keywords.

This has been thought about for a long time it seems, but no one has really 
solved the annoying issue of how to avoid most false positives.  I don't 
think you can avoid them all because English is so ambiguous but the use of 
cross-links is a major leap forward.

Very few people are going to link-up concepts that are basely minor, but 
scan all pages for the links highlights the semantic connetions between 
concepts.  You could even take it one step further, use the semantic web to 
point 
out semantic connections that are not directly obvious.  Such as a leap 
from beekeeper to honeycomb.  Try to do that using Google.  You get thousands 
of bad hits before you get the one good one.

Search for Hillbillies and Movie, using a semantic web you get the 
exact hit you want.

W.J.

___
WikiEN-l mailing list
WikiEN-l@lists.wikimedia.org
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l