Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-12 Thread Gautham Shankar
Robert Stojnic  gmail.com> writes:

> 
> 
> Hello,
> 
> Yep, generating the wodnet itself is a challenging and interesting 
> project. I was simply commenting on the Lucene part, i.e. on possible 
> application.
> 
> Currently the lucene backend works by employing some very general rules 
> (e.g. titles get highest score, then first sentence in articled, then 
> first paragraph, then words occurring in clusters e.g. within ~20 words, 
> etc..). However, in many cases they fail.
> 
> I found it helpful to run a number of queries and then see when/why the 
> search fails to identify the most relevant article. When wordnet is 
> mentioned, two examples come in mind which are both currently unsolved. 
> One is a query of type "mao last name" where an article "mao (surname)". 
> If we are lucky, the article will have words "last name" somewhere in 
> the article and the search won't totally fail, however, it would be nice 
> if the algorithm knew that "last name" == "surname". Another is when the 
> query is of type "population of africa" and the article "African 
> population". That is, it would be helpful if the backend knew of 
> language constructs like "x of y" == "x-an y". I wonder if Wordnet type 
> of approach can find those cases as well.
> 
> Cheers, Robert
> 
> On 06/04/12 17:54, Oren Bochman wrote:
> > Hi Robert Stojnic and Gautham Shankar
> >
> > I wanted to let Gautham that he has written a great proposal and thank you
> > for the feedback as well.
> >
> > I wanted to point out that in my point of view the main goal of this
> > multilingual wordnet isn't queary expansion, but rather means for ever
> > greater cross language capabilites in search and content analytics. A
> > wordnet seme can be  further disambiguated using a topic map algorithm run
> > which would consider all the contexts like you suggest. But this is planned
> > latter and so the wordnet would be a milestone.
> > To further clarify Gautham's integration will place a XrossLanguage-seme
> > Word Net tokens during indexing for words it recognises - allow the ranking
> > algorithm to use knowldege drawn from all the wikipedia articles.
> > (For example one part of the ranking would peek into featured article in
> > German on "A" rank it>>  then "B" featured in Hungarian and use them as
> > oracles to rank A>>  B>>  ... in English where the picture might now be X
> >>> Y>>  Z>>  ... B>>  A ...)
> > I mention in passing that I have began to develop dataset for use with open
> > relavance to sytematicly review and evaluate dramatic changes to relevance
> > due to changes in the search engine. I will post on this in due course as
> > it matures - since I am working on a number of smaller projects i'd like to
> > demo at WikiMania.)
> >

Hello,

Thank you Oren for your feedback , would love to work on the wordnet creation 
if 
given an opportunity.

And regarding Robert's mail, yes I believe that using a wordnet will be able to 
solve the problem in both the examples you pointed out.

In the first case during query expansion, the word "last name" would yield the 
synonyms of the word , one of them being "surname". Thus when the query is run 
there will be a hit for the article "mao (surname)".

In the second example, the word "Africa" will be drilled down to get derived 
words like "African" . Also the in other cases the root words will be found and 
searched for. In this case "Africa" is already a root word. So hopefully these 
expansions should solve the language construct problems.

Again the key is to filter out the noise that could come from adding unwanted 
expansion words. For this we will have to find the relevance of the expansion 
words with respect to the given search query and the existing documents. Maybe 
the TSN concept that i pointed out in the earlier mail would help in doing so.

Regards,
Gautham Shankar



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-08 Thread Robert Stojnic


Hello,

Yep, generating the wodnet itself is a challenging and interesting 
project. I was simply commenting on the Lucene part, i.e. on possible 
application.


Currently the lucene backend works by employing some very general rules 
(e.g. titles get highest score, then first sentence in articled, then 
first paragraph, then words occurring in clusters e.g. within ~20 words, 
etc..). However, in many cases they fail.


I found it helpful to run a number of queries and then see when/why the 
search fails to identify the most relevant article. When wordnet is 
mentioned, two examples come in mind which are both currently unsolved. 
One is a query of type "mao last name" where an article "mao (surname)". 
If we are lucky, the article will have words "last name" somewhere in 
the article and the search won't totally fail, however, it would be nice 
if the algorithm knew that "last name" == "surname". Another is when the 
query is of type "population of africa" and the article "African 
population". That is, it would be helpful if the backend knew of 
language constructs like "x of y" == "x-an y". I wonder if Wordnet type 
of approach can find those cases as well.


Cheers, Robert


On 06/04/12 17:54, Oren Bochman wrote:

Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you
for the feedback as well.

I wanted to point out that in my point of view the main goal of this
multilingual wordnet isn't queary expansion, but rather means for ever
greater cross language capabilites in search and content analytics. A
wordnet seme can be  further disambiguated using a topic map algorithm run
which would consider all the contexts like you suggest. But this is planned
latter and so the wordnet would be a milestone.
To further clarify Gautham's integration will place a XrossLanguage-seme
Word Net tokens during indexing for words it recognises - allow the ranking
algorithm to use knowldege drawn from all the wikipedia articles.
(For example one part of the ranking would peek into featured article in
German on "A" rank it>>  then "B" featured in Hungarian and use them as
oracles to rank A>>  B>>  ... in English where the picture might now be X

Y>>  Z>>  ... B>>  A ...)

I mention in passing that I have began to develop dataset for use with open
relavance to sytematicly review and evaluate dramatic changes to relevance
due to changes in the search engine. I will post on this in due course as
it matures - since I am working on a number of smaller projects i'd like to
demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar<
gautham.shan...@hiveusers.com>  wrote:


Robert Stojnic  gmail.com>  writes:



Hi Gautham,

I think mining wiktionary is an interesting project. However, about the
more practical Lucene part: at some point I tried using wordnet to
expand queries however I found that it introduces too many false
positives. The most challenging part I think it *context-based*
expansion. I.e. a simple synonym-based expansion is of no use because it
introduces too many meanings that the user didn't quite have in mind.
However, if we could somehow use the words in the query to find a
meaning from a set of possible meanings that could be really helpful.

You can look into existing lucene-search source to see how I used
wordnet. I think in the end I ended up using it only for very obvious
stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).

Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be
useful.
I have also submitted the proposal at the GSOC page.

Regards,
Gautham Shankar
___
Wikitech-l mailing list
Wikitech-l  lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Hi Robert,

Thank you for your feedback.
Like you pointed out, query expansion using the wordnet data directly,
reduces
the quality of the search.

I found this research paper very interesting.
www.sftw.umac.mo/~fstzgg/dexa2005.pdf
They have built a TSN (Term Semantic Network) for the given query based on
the
usage of words in the documents. The expansion words obtained from the
wordnet
are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more
with the
creation of the wordnet. I would love to implement the TSN concept once the
wordnet is complete.

Regards,
Gautham Shankar



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Hi again




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-06 Thread Oren Bochman
Hi Robert Stojnic and Gautham Shankar

I wanted to let Gautham that he has written a great proposal and thank you
for the feedback as well.

I wanted to point out that in my point of view the main goal of this
multilingual wordnet isn't queary expansion, but rather means for ever
greater cross language capabilites in search and content analytics. A
wordnet seme can be  further disambiguated using a topic map algorithm run
which would consider all the contexts like you suggest. But this is planned
latter and so the wordnet would be a milestone.
To further clarify Gautham's integration will place a XrossLanguage-seme
Word Net tokens during indexing for words it recognises - allow the ranking
algorithm to use knowldege drawn from all the wikipedia articles.
(For example one part of the ranking would peek into featured article in
German on "A" rank it >> then "B" featured in Hungarian and use them as
oracles to rank A >> B >> ... in English where the picture might now be X
>> Y >> Z >> ... B >> A ...)

I mention in passing that I have began to develop dataset for use with open
relavance to sytematicly review and evaluate dramatic changes to relevance
due to changes in the search engine. I will post on this in due course as
it matures - since I am working on a number of smaller projects i'd like to
demo at WikiMania.)

On Fri, Apr 6, 2012 at 6:01 PM, Gautham Shankar <
gautham.shan...@hiveusers.com> wrote:

> Robert Stojnic  gmail.com> writes:
>
> >
> >
> > Hi Gautham,
> >
> > I think mining wiktionary is an interesting project. However, about the
> > more practical Lucene part: at some point I tried using wordnet to
> > expand queries however I found that it introduces too many false
> > positives. The most challenging part I think it *context-based*
> > expansion. I.e. a simple synonym-based expansion is of no use because it
> > introduces too many meanings that the user didn't quite have in mind.
> > However, if we could somehow use the words in the query to find a
> > meaning from a set of possible meanings that could be really helpful.
> >
> > You can look into existing lucene-search source to see how I used
> > wordnet. I think in the end I ended up using it only for very obvious
> > stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).
> >
> > Cheers, r.
> >
> > On 06/04/12 01:58, Gautham Shankar wrote:
> > > Hello,
> > >
> > > Based on the feedback i received i have updated my proposal page.
> > >
> > > https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc
> > >
> > > There is about 20 Hrs for the deadline and any final feedback would be
> > > useful.
> > > I have also submitted the proposal at the GSOC page.
> > >
> > > Regards,
> > > Gautham Shankar
> > > ___
> > > Wikitech-l mailing list
> > > Wikitech-l  lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > >
> >
>
> Hi Robert,
>
> Thank you for your feedback.
> Like you pointed out, query expansion using the wordnet data directly,
> reduces
> the quality of the search.
>
> I found this research paper very interesting.
> www.sftw.umac.mo/~fstzgg/dexa2005.pdf
> They have built a TSN (Term Semantic Network) for the given query based on
> the
> usage of words in the documents. The expansion words obtained from the
> wordnet
> are then filtered out based on the TSN data.
>
> I did not add this detail to my proposal since i thought it deals more
> with the
> creation of the wordnet. I would love to implement the TSN concept once the
> wordnet is complete.
>
> Regards,
> Gautham Shankar
>
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

Hi again

-- 

Oren Bochman

Office tel. 061 4921492
Mobile +36 30 866 6706
skype id: orenbochman
e-mail: o...@romai-horizon.com
site http://www.riverport.hu
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-06 Thread Gautham Shankar
Robert Stojnic  gmail.com> writes:

> 
> 
> Hi Gautham,
> 
> I think mining wiktionary is an interesting project. However, about the 
> more practical Lucene part: at some point I tried using wordnet to 
> expand queries however I found that it introduces too many false 
> positives. The most challenging part I think it *context-based* 
> expansion. I.e. a simple synonym-based expansion is of no use because it 
> introduces too many meanings that the user didn't quite have in mind. 
> However, if we could somehow use the words in the query to find a 
> meaning from a set of possible meanings that could be really helpful.
> 
> You can look into existing lucene-search source to see how I used 
> wordnet. I think in the end I ended up using it only for very obvious 
> stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).
> 
> Cheers, r.
> 
> On 06/04/12 01:58, Gautham Shankar wrote:
> > Hello,
> >
> > Based on the feedback i received i have updated my proposal page.
> >
> > https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc
> >
> > There is about 20 Hrs for the deadline and any final feedback would be
> > useful.
> > I have also submitted the proposal at the GSOC page.
> >
> > Regards,
> > Gautham Shankar
> > ___
> > Wikitech-l mailing list
> > Wikitech-l  lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> 

Hi Robert,

Thank you for your feedback.
Like you pointed out, query expansion using the wordnet data directly, reduces 
the quality of the search.

I found this research paper very interesting.
www.sftw.umac.mo/~fstzgg/dexa2005.pdf
They have built a TSN (Term Semantic Network) for the given query based on the 
usage of words in the documents. The expansion words obtained from the wordnet 
are then filtered out based on the TSN data.

I did not add this detail to my proposal since i thought it deals more with the 
creation of the wordnet. I would love to implement the TSN concept once the 
wordnet is complete.

Regards,
Gautham Shankar



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-06 Thread Robert Stojnic


Hi Gautham,

I think mining wiktionary is an interesting project. However, about the 
more practical Lucene part: at some point I tried using wordnet to 
expand queries however I found that it introduces too many false 
positives. The most challenging part I think it *context-based* 
expansion. I.e. a simple synonym-based expansion is of no use because it 
introduces too many meanings that the user didn't quite have in mind. 
However, if we could somehow use the words in the query to find a 
meaning from a set of possible meanings that could be really helpful.


You can look into existing lucene-search source to see how I used 
wordnet. I think in the end I ended up using it only for very obvious 
stuff (e.g. 11 = eleven, UK = United Kingdom, etc..).


Cheers, r.

On 06/04/12 01:58, Gautham Shankar wrote:

Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be
useful.
I have also submitted the proposal at the GSOC page.

Regards,
Gautham Shankar
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] GSOC 2012: "Lucene Automatic Query Expansion From Wikipedia Text"

2012-04-06 Thread Gautham Shankar
Hi,

I have addressed the issues in my talk page and added a 'Future Project
Maintenance' section to address maintenance needs.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

Kindly let me know if there are any other changes i have to make.

Thank you for your support,

Regards,
Gautham Shankar
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-05 Thread Gregory Varnum
Also a reminder for folks that this and some other proposals need mentors.

Gautham - thank you for the updated proposal page. I would also solicit 
feedback in our Irc channel if you can and connect with interested mentors:
https://www.mediawiki.org/wiki/GSOC#Mentor_signup

https://www.mediawiki.org/wiki/MediaWiki_on_IRC

-Greg aka varnent

___
Sent from my iPad. Apologies for any typos. A more detailed response may be 
sent later.

On Apr 5, 2012, at 8:58 PM, Gautham Shankar  
wrote:

> Hello,
> 
> Based on the feedback i received i have updated my proposal page.
> 
> https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc
> 
> There is about 20 Hrs for the deadline and any final feedback would be
> useful.
> I have also submitted the proposal at the GSOC page.
> 
> Regards,
> Gautham Shankar
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion From Wikipedia Text

2012-04-05 Thread Gautham Shankar
Hello,

Based on the feedback i received i have updated my proposal page.

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

There is about 20 Hrs for the deadline and any final feedback would be
useful.
I have also submitted the proposal at the GSOC page.

Regards,
Gautham Shankar
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion from Wikipedia Text

2012-04-04 Thread Gregory Varnum
Greetings,

Thank you for putting this proposal together.

I would expand a bit on how you plan to implement this.  The why and what seem 
reasonably clear to me in your proposal, but I'd be curious what others think.

You'll also want to look at the GSOC page on MW.org and in the IRC to aide your 
efforts to find an interested mentor.

-greg


On Apr 4, 2012, at 6:25 PM, Gautham Shankar  
wrote:

> Hello,
> 
> I'm Gautham Shankar from India pursuing my 4th year bachelors in computer
> science and engineering.I find the project proposal "Lucene Automatic Query
> Expansion from Wikipedia Text" in GSOC 2012 very interesting and would love
> to work on it.
> 
> i have created a proposal for the idea
> 
> https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc
> 
> I have experience in data mining and have built a recommendation framework
> using the heat diffusion principle which has been tested on the AOL search
> dataset to recommend better queries that can be typed
> for a given input query.It has been implemented in java. Since it is a
> framework it can be used to recommend different types of data. for example
> the same framework can be used to recommend movies as well as music.Im
> currently working on an extension of this project to add social network
> graphs so as to recommend people who have the same interests in movie,
> music etc when a query is typed.
> 
> I have also built a web based product "hive" which is a networking platform
> for members of the power generation industry. The users can share their
> experiences and it is a open forum where members interact with one another
> to effectively run their machines and solve common problems. The product
> has been implemented using PHP, mysql, javascript (inc ajax). Lucene is the
> search engine and phpbb is used for forums.
> 
> it would be very helpful if anyone could give a feedback and guide me in
> improving the proposal.
> 
> Eagerly awaiting a response.
> 
> Regards,
> Gautham Shankar
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] GSOC 2012 : Lucene Automatic Query Expansion from Wikipedia Text

2012-04-04 Thread Gautham Shankar
Hello,

I'm Gautham Shankar from India pursuing my 4th year bachelors in computer
science and engineering.I find the project proposal "Lucene Automatic Query
Expansion from Wikipedia Text" in GSOC 2012 very interesting and would love
to work on it.

i have created a proposal for the idea

https://www.mediawiki.org/wiki/User:Gautham_shankar/Gsoc

I have experience in data mining and have built a recommendation framework
using the heat diffusion principle which has been tested on the AOL search
dataset to recommend better queries that can be typed
for a given input query.It has been implemented in java. Since it is a
framework it can be used to recommend different types of data. for example
the same framework can be used to recommend movies as well as music.Im
currently working on an extension of this project to add social network
graphs so as to recommend people who have the same interests in movie,
music etc when a query is typed.

I have also built a web based product "hive" which is a networking platform
for members of the power generation industry. The users can share their
experiences and it is a open forum where members interact with one another
to effectively run their machines and solve common problems. The product
has been implemented using PHP, mysql, javascript (inc ajax). Lucene is the
search engine and phpbb is used for forums.
 
it would be very helpful if anyone could give a feedback and guide me in
improving the proposal.

Eagerly awaiting a response.

Regards,
Gautham Shankar
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l