Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Gautham Shankar Thu, 12 Apr 2012 02:28:42 -0700

Robert Stojnic <rainmansr <at> gmail.com> writes:

> 
> 
> Hello,
> 
> Yep, generating the wodnet itself is a challenging and interesting 
> project. I was simply commenting on the Lucene part, i.e. on possible 
> application.
> 
> Currently the lucene backend works by employing some very general rules 
> (e.g. titles get highest score, then first sentence in articled, then 
> first paragraph, then words occurring in clusters e.g. within ~20 words, 
> etc..). However, in many cases they fail.
> 
> I found it helpful to run a number of queries and then see when/why the 
> search fails to identify the most relevant article. When wordnet is 
> mentioned, two examples come in mind which are both currently unsolved. 
> One is a query of type "mao last name" where an article "mao (surname)". 
> If we are lucky, the article will have words "last name" somewhere in 
> the article and the search won't totally fail, however, it would be nice 
> if the algorithm knew that "last name" == "surname". Another is when the 
> query is of type "population of africa" and the article "African 
> population". That is, it would be helpful if the backend knew of 
> language constructs like "x of y" == "x-an y". I wonder if Wordnet type 
> of approach can find those cases as well.
> 
> Cheers, Robert
> 
> On 06/04/12 17:54, Oren Bochman wrote:
> > Hi Robert Stojnic and Gautham Shankar
> >
> > I wanted to let Gautham that he has written a great proposal and thank you
> > for the feedback as well.
> >
> > I wanted to point out that in my point of view the main goal of this
> > multilingual wordnet isn't queary expansion, but rather means for ever
> > greater cross language capabilites in search and content analytics. A
> > wordnet seme can be  further disambiguated using a topic map algorithm run
> > which would consider all the contexts like you suggest. But this is planned
> > latter and so the wordnet would be a milestone.
> > To further clarify Gautham's integration will place a XrossLanguage-seme
> > Word Net tokens during indexing for words it recognises - allow the ranking
> > algorithm to use knowldege drawn from all the wikipedia articles.
> > (For example one part of the ranking would peek into featured article in
> > German on "A" rank it>>  then "B" featured in Hungarian and use them as
> > oracles to rank A>>  B>>  ... in English where the picture might now be X
> >>> Y>>  Z>>  ... B>>  A ...)
> > I mention in passing that I have began to develop dataset for use with open
> > relavance to sytematicly review and evaluate dramatic changes to relevance
> > due to changes in the search engine. I will post on this in due course as
> > it matures - since I am working on a number of smaller projects i'd like to
> > demo at WikiMania.)
> >


Hello,

Thank you Oren for your feedback , would love to work on the wordnet creation 
if 
given an opportunity.

And regarding Robert's mail, yes I believe that using a wordnet will be able to 
solve the problem in both the examples you pointed out.

In the first case during query expansion, the word "last name" would yield the 
synonyms of the word , one of them being "surname". Thus when the query is run 
there will be a hit for the article "mao (surname)".

In the second example, the word "Africa" will be drilled down to get derived 
words like "African" . Also the in other cases the root words will be found and 
searched for. In this case "Africa" is already a root word. So hopefully these 
expansions should solve the language construct problems.

Again the key is to filter out the noise that could come from adding unwanted 
expansion words. For this we will have to find the relevance of the expansion 
words with respect to the given search query and the existing documents. Maybe 
the TSN concept that i pointed out in the earlier mail would help in doing so.

Regards,
Gautham Shankar



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Reply via email to