Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Robert Stojnic Sun, 08 Apr 2012 03:53:24 -0700


Hello,

Yep, generating the wodnet itself is a challenging and interestingproject. I was simply commenting on the Lucene part, i.e. on possibleapplication.

Currently the lucene backend works by employing some very general rules(e.g. titles get highest score, then first sentence in articled, thenfirst paragraph, then words occurring in clusters e.g. within ~20 words,etc..). However, in many cases they fail.

I found it helpful to run a number of queries and then see when/why thesearch fails to identify the most relevant article. When wordnet ismentioned, two examples come in mind which are both currently unsolved.One is a query of type "mao last name" where an article "mao (surname)".If we are lucky, the article will have words "last name" somewhere inthe article and the search won't totally fail, however, it would be niceif the algorithm knew that "last name" == "surname". Another is when thequery is of type "population of africa" and the article "Africanpopulation". That is, it would be helpful if the backend knew oflanguage constructs like "x of y" == "x-an y". I wonder if Wordnet typeof approach can find those cases as well.


Cheers, Robert


On 06/04/12 17:54, Oren Bochman wrote:

Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you
for the feedback as well.

I wanted to point out that in my point of view the main goal of this
multilingual wordnet isn't queary expansion, but rather means for ever
greater cross language capabilites in search and content analytics. A
wordnet seme can be  further disambiguated using a topic map algorithm run
which would consider all the contexts like you suggest. But this is planned
latter and so the wordnet would be a milestone.
To further clarify Gautham's integration will place a XrossLanguage-seme
Word Net tokens during indexing for words it recognises - allow the ranking
algorithm to use knowldege drawn from all the wikipedia articles.
(For example one part of the ranking would peek into featured article in
German on "A" rank it>>  then "B" featured in Hungarian and use them as
oracles to rank A>>  B>>  ... in English where the picture might now be X

Y>>  Z>>  ... B>>  A ...)

I mention in passing that I have began to develop dataset for use with open
relavance to sytematicly review and evaluate dramatic changes to relevance
due to changes in the search engine. I will post on this in due course as
it matures - since I am working on a number of smaller projects i'd like to
demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar<
gautham.shan...@hiveusers.com>  wrote:

Robert Stojnic<rainmansr<at>  gmail.com>  writes:


Hi Gautham,

I think mining wiktionary is an interesting project. However, about the
more practical Lucene part: at some point I tried using wordnet to
expand queries however I found that it introduces too many false
positives. The most challenging part I think it *context-based*
expansion. I.e. a simple synonym-based expansion is of no use because it
introduces too many meanings that the user didn't quite have in mind.
However, if we could somehow use the words in the query to find a
meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used
wordnet. I think in the end I ended up using it only for very obvious
stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be
useful.
I have also submitted the proposal at the GSOC page.

Regards,
Gautham Shankar
_______________________________________________
Wikitech-l mailing list
Wikitech-l<at>  lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi Robert,

Thank you for your feedback.
Like you pointed out, query expansion using the wordnet data directly,
reduces
the quality of the search.

I found this research paper very interesting.
www.sftw.umac.mo/~fstzgg/dexa2005.pdf<http://www.sftw.umac.mo/%7Efstzgg/dexa2005.pdf>
They have built a TSN (Term Semantic Network) for the given query based on
the
usage of words in the documents. The expansion words obtained from the
wordnet
are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more
with the
creation of the wordnet. I would love to implement the TSN concept once the
wordnet is complete.

Regards,
Gautham Shankar



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Hi again



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

Reply via email to