Hi Harshit,
Please note that DBpedia Spotlight is not the same as DBpedia. The projects
are related, and there is a very close collaboration, but DBpedia Spotlight
is a text annotation tool. DBpedia is a knowledge base produced by the
DBpedia Extraction Framework (DEF). Maybe we shouldn't use the same "family
name" for everything. :)

Enabling a hadoop/pig pipeline that generates the statistics we need for
DBpedia Spotlight, is a good start. But a good deal of that is already
available from pignlproc (https://github.com/dicode-project/pignlproc). The
current code is able to count how many times a phrase occurred with a
DBpedia resource, annotated and un-annotated, to generate some statistics
we need. Connecting that with our indexing code is a second step. Our
indexing code uses Lucene at the moment.
http://dbp-spotlight.svn.sourceforge.net/viewvc/dbp-spotlight/trunk/index/src/main/scala/org/dbpedia/spotlight/lucene/index/IndexMergedOccurrences.scala?revision=377&view=markup

We have extended Lucene with our own scoring function. Feel free to study
our paper so that you know what kinds of statistics you will need to gather
in hadoop/pig so that we can produce the scores at query time (I also
summarize them later in this message).
http://wiki.dbpedia.org/spotlight/isemantics2011

Another idea from the page plays nicely here with this second step:
indexing the statistics in something else besides Lucene for performance
evaluation. For example, we'd be interested in evaluating something like
project voldemort or other database that would offer better time
performance single machine or horizontal scaling, etc.

And to top it up, the topical classification comes in. One could extend the
pignlproc scripts to collect also category-specific statistics, so that
they can be used later for topical classification
We compute p(uri): the probability of seeing a URI as link anchor target in
Wikipedia; p(sf|uri): the prob. of seeing a given phrase (aka surface form)
"sf" as anchor text when "uri" is the target link; and p(sf) the
probability of seeing the surface form "sf" within anchor text. In order to
make this topical, you would include now the category component and
compute: p(uri|cat), p(sf|uri,cat),p(sf|cat). Same works for p(w|uri) and
p(w|uri,cat) where "w" is a word occurring in text around the links in
DBpedia. If you are familiar with language model smoothing or
dimensionality reduction (SVD,PCA,LSI), please mention your plans with
regard to that in your proposal.

As far as execution plan, I would start by working on the processing side,
using pig, and generating all the necessary counts. That should be doable
reasonably quickly, and would allow several directions to follow. After
that is done, I would work on either:
- some database-backed indexing of those counts (would be interesting to
contrast SQL and NoSQL for this). Plan for evaluating the time performance
and space requirements of each.
- several algorithms for topical classification based on the data
generated. One challenge that will come up is how to choose categories that
are really indicative of topics (rather than Wikipedia artifacts), which
categories are not too sparse to produce good results, etc. The category
hierarchy both helps and complicates things here. One place to look for
prominent categories is http://en.wikipedia.org/wiki/Portal:Contents/Portals In
general, including plans for dealing with such issues will make your
proposal stronger.

Please note that I am suggesting several topics related to your questions,
so that you can pick what interests you most. Please select an amount of
work that is compatible with your experience and available time. Awesome
applications contain a detailed, well informed project plan.

Please let me know if I can help further.

Cheers,
Pablo

On Sat, Mar 17, 2012 at 10:35 AM, Harshit Dubey <[email protected]>
wrote:
>
> Hi all,
>
> I'm a pre-final year undergraduate student from IIIT-Hyderabad. I'm
looking forward to participate in GSOC this year.
> I recently came across Dbpedia and is really interested in the project
ideas. I think doing a GSoC project with Dbpedia will be a great learning
experience.
> I have a lot of experience of working with databases, along with hadoop
and Nosql databases. I am interested in the following projects :
>
>  1. Hadoop-based Indexing : I have done a lot of work that deals with
indexing , searching and retrieving data. I have indexed the entire
wikipedia for a project work.
>  2. Topical classification : I have been pursuing my masters in the field
of data mining , so have worked a lot on classification and clustering
algorithms.
>  3. Text formats
>  4. Better support for short messages
>
> I really appreciate if someone could provide further details, resources I
have to refer to
>
> Thanks and regards
>
> --
> Harshit Dubey
> Undergraduate (Dual degree: Masters + B.tech )
> Computer Science Engineering
> International Institute of Information Technology Hyd
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Dbp-spotlight-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users

Reply via email to