Re: [Dbp-spotlight-users] [GSoC 2012] Project Proposal for "Integrate DBpedia Spotlight as Enhancement Engine within Apache Stanbol" (Siwei Yu)

Siwei Yu Sun, 01 Apr 2012 11:07:31 -0700

Dear Pablo,

I did mean Chinese support instead of Chinese *version*. Thanks for
the instructions! I've got a general idea of the scopes for this sub
task. Before digging into the details, I'd like to tell you my
concern.


These days, I''v been thinking about the project plan. As discussed,
there're 3 tasks for this GSoC project: (1) Stanbol wrappers, (2)
/feedback service, (3) Chinese Support. I'm very interested in working
on all of them this summer. I'd like to study and contribute to the
DBpedia Spotlight community. But too many details of the technical
issues for the 3 tasks should be discussed further more, which
definitely requires much time. I'm not that confident to complete all
of them under GSoC timeline [1]. Does it make sense to you, if I
expend the GSoC program to some extent? You know that the GSoC
projects end in August. However, I propose that the project plan goes
in this way: to complete task (1) before mid-term evaluation (in early
July); to complete task (2) before GSoC program closed (in late
August); to complete task (3) before October (yes, out of GSoC
program). The order of the tasks can be discussed. The main idea is to
complete 2 tasks within the GSoC program. Additionally, I can promise
to accomplish the third one before October. The reason for this way,
is the PhD course program in my school. This term ends in early June,
before that I need some time working on some course examinations. I'm
a free man after that, available until the start of the next term in
early October. Therefore, following the GSoC timeline for the 3 tasks
may be difficult for me. It's greatly appreciated if you can
reconsider my situation. Actually, I regard this GSoC project as an
opportunity of getting into the DBpedia Spotlight community. I'd like
to continuously contribute to this great open source project in the
field of Semantic Web, even after GSoC program. Can we make it more
flexible on the timeline for the project plan?

For Jimmy's curiosity, I also turned to my friend Tao Lin for advice,
who has completed 2 individual sub tasks for LanguageTool
(Unfortunately, it's not accepted as a mentor org this year) in a GSoC
2011 project [2]. He suggested not endeavoring too much, in order to
guarantee the project quality. He also provided some suggestion on how
to make a successful GSoC application: (1) good communication with the
community and mentors for the project requirements, (2) decent and
detailed project plan, (3) demonstrative abilities to complete the
project. Are these good enough? I'll try my best.

Yours faithfully,
Siwei Yu


[1] http://www.google-melange.com/gsoc/events/google/gsoc2012
[2] http://languagetool.wikidot.com/start
How to Use Indexer and Searcher for Fast Rule Evaluation (GSoC 2011)
Developing Chinese rules (GSoC 2011)



On Thu, Mar 29, 2012 at 4:52 PM, Pablo Mendes <[email protected]> wrote:
> Hi Siwei,
>
> On Mar 29, 2012 9:44 AM, "Siwei Yu" <[email protected]> wrote:
>>
>> Dear Pablo,
>>
>> I'm glad to see your relpy of the details of the /feedback service.
>> It's really a cool thing. I'm clear about it now, and I think I can
>> make it this summer.
>>
>> Besides the Stanbol wrappers and the /feedback service, I'm more
>> interested in developing Chinese version of DBpedia Spotlight.
>
> We are not interested in a Chinese *version* of DBpedia Spotlight as much as
> we are interested in introducing support for Chinese language in one unified
> DBpedia Spotlight. But I think that's what you mean. If that's the case,
> make sure your proposal states that clearly.
>
>> Because
>> I'm a Chinese student, who studies Chinese NLP as well. When I was a
>> master student, I worked on developing Chinese analyzer for Lucene
>> 1.x. It involved Chinese chunking algorithm and POS tagger. Now,
>> Lucene 3.x contains built in SmartChineseAnalyzer [1] based on Hidden
>> Markov Model. A friend of mine worked on Chinese support for
>> LanguageTool [2], which used ictclas4j [3] as the Chinese POS tagger.
>> Note that the licence of ictclas4j is Apache License 2.0. I know quite
>> some other Chinese NLP tools based on comprehensive survey, but they
>> are with GPL/LGPL licence. Licence issue is the reason why
>> LanguageTool chooses ictclas4j. So, we may consider ictclas4j as the
>> best candidate for NLP library for Chinese version of DBpedia
>> Spotlight.
>
> Please also include that on the proposal.
>
>>
>> I just read the Jimmy's discussion about Russian support for DBpedia
>> Spotlight, as you pointed out. It seems that NLP library is the
>> prerequisite for internationalization of DBpedia Spotlight. Where
>> should we use this NLP library, in Spotting, Candidate Selection,
>> Disambiguation, Filtering or even /feedback?
>
> Spotting requires tokenization, and could also use POS tagging and entity
> boundary detection (e.g. NP chunking)
> CandSel should be mostly language independent, but could benefit from things
> like fuzzy matches, acronym expansion, etc.
> Disambiguation itself should be language independent, but the inner model
> (e.g. VSM) requires tokenization, stemming ,etc.
> Linking idem
>
>> And how to use it?
>
> Well, that's your task. We can try to help by answering questions, but for
> that you have to study the tools/techniques, so that the questions arise.
>
>> I also
>> want to know what are the product of this sub task, e.g. some Chinese
>> data to be generated in Data Generation Workflow [1], some coding work
>> such as developing a Chinese Disambiguator, or any others?
>
> Mostly finding out which components are available for the tasks I mentioned
> above, and integrating them within DBpedia Spotlight to live alongside other
> languages. I don't need to mention testing. Could look into TAC KBP
> English-Chinese Entity Linking.
>
>> It would
>> help if you can show me how some other languages are enabled for
>> DBpedia Spotlight, such as French or German?
>
> We have people working on these, but most have not contributed code back to
> us yet. There are useful discussions in this list if you browse around a
> bit.
>
> Cheers
> Pablo
>
>
>>
>> Best regards,
>> Siwei Yu
>>
>> [1]
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/all/org/apache/lucene/analysis/cn/smart/hhmm/package-summary.html
>> [2] http://languagetool.wikidot.com/developing-chinese-rules
>> [3] http://code.google.com/p/ictclas4j/
>> [4]
>> http://sourceforge.net/mailarchive/forum.php?thread_name=CAHh9-xtF4-A8W2WST1mnHj2bugGyZtOOJp34mBGNPSWR7ZerFw%40mail.gmail.com&forum_name=dbp-spotlight-developers
>> [5] http://wiki.dbpedia.org/spotlight/technicaldocumentation?v=3qy
>>
>>
>>
>> On Wed, Mar 28, 2012 at 6:57 PM, Pablo Mendes <[email protected]>
>> wrote:
>> > Hi Siwei,
>> > Answer inline
>> >
>> > On Tue, Mar 27, 2012 at 6:14 PM, Siwei Yu <[email protected]> wrote:
>> >>
>> >> Dear Pablo,
>> >>
>> >> Thanks for you suggestion! I've studied Apache Stanbol documents
>> >> pointed out by Rupert, especially those about EnhancementChain. Now I
>> >> have a big picture in my head about how to wrap DBpedia Spotlight
>> >> services and integrate them into Apache Stanbol. Although the details
>> >> of the integration needs further discussion, I agree with you that
>> >> wrapping DBpedia Spotlight classes is too simple to fill this GSoC
>> >> project. As a student major in Semantic Web, I'm very interested in
>> >> contributing to DBpedia Spotlight by add the new service of /feedback
>> >> in this summer. It's greatly appreciated if you can help me understand
>> >> the details of /feedback.
>> >>
>> >> Let's take a simple example for discussion. Suppose we have a text to
>> >> be annotated: "I'm a taxpayer."
>> >> The four stages would make an EnhancementChain, which processes in
>> >> order like this:
>> >> (1) Spotting: "taxpayer" should be identified as a surface form.
>> >> (2) Candidate: it provides 3 candidates for  "taxpayer", which are:
>> >>  a) http://dbpedia.org/resource/Tax, with confidence of 0.3
>> >>  b) http://dbpedia.org/resource/TaxPayers%27_Alliance, with confidence
>> >> of
>> >> 0.2
>> >>  c) http://dbpedia.org/resource/The_Taxpayer_%28Luxembourg%29, with
>> >> confidence of 0.1
>> >> (3) Disambiguation: return a) as the best candidate, according to
>> >> confidence value
>> >> (4) Filtering: do nothing
>> >
>> >
>> >
>> > Your understanding of the steps is correct.
>> >
>> >
>> >>
>> >>
>> >> Do you think /feedback should take place between (2) and (3), or after
>> >> (4)? I think its a service after (4).
>> >
>> >
>> >
>> >
>> > Right. I think  of /feedback as (5), while if you have something in
>> > (2.5) I
>> > would call it "manual disambiguation".
>> >
>> >
>> >>
>> >> The user find that a) is not the
>> >> right annotation he expects. He regards it as a mistake. He may
>> >> provide the following correction information (i.e. the input
>> >> parameters of the /feedback service) to /feedback:
>> >> text=I'm a taxpayer. (yes, the text should be escaped as a url
>> >> parameter)
>> >> surface_form=taxpayer
>> >> incorrect_annotation=http://dbpedia.org/resource/Tax
>> >> correct_annotation=http://dbpedia.org/resource/TaxPayers%27_Alliance
>> >> After it receives the correction information, the /feedback service
>> >> stores it for future use. (Where and How to store? In Lucene index?)
>> >
>> >
>> > Also add a WebID for the user as a parameter.
>> > Also add a URI to identify this occurrence.
>> > You would store it in any database you would like, for example Redis.
>> > These
>> > examples will be available for DBpedia Spotlight and other Stanbol
>> > enhancers
>> > to pull.
>> > If we have a streaming indexing pipeline going, the correction could be
>> > fed
>> > straight into the index as well.
>> >
>> >>
>> >>
>> >> Does the above description satisfy the your idea? If so, what
>> >> information (XML/JSON/RDFa) would the /feedback service return to the
>> >> client? Just return boolean true for success of feedback storing, and
>> >> false for failure?
>> >
>> >
>> > Right. We return some HTTP code that indicates that the request failed
>> > or
>> > sufficed, with some message explaining what happened.
>> >
>> >>
>> >>
>> >> Additionally, we need a new service of /getFeedback for query the
>> >> feedbacks collected from the /feedback service. As you said, the
>> >> filtering implementations that would use feedback data from
>> >> /getFeedback service to stop making the same mistakes.
>> >
>> >
>> >
>> > Well, that's what HTTP GET and POST are for. We could potentially think
>> > about PUT, DELETE, etc. But I think this is less interesting at this
>> > point.
>> >
>> >>
>> >> That means we
>> >> need to let the /filter service construct SPARQL query string to
>> >> filter the incorrect annotations from feedback data.
>> >
>> >
>> >
>> > Right. With /feedback in place, now /filter becomes even more
>> > interesting,
>> > because you have a set of positive and a set of negative examples. One
>> > could
>> > train a machine learning classifier that eventually stops making the
>> > same
>> > mistakes given enough examples.
>> >
>> >
>> >>
>> >> Am I in the right direction? If so, I'm not sure about what the input
>> >> and
>> >> output of
>> >> /getFeedback should be in this example. Can you provide me some hints?
>> >> I can not imagine how to deal with it in this example.
>> >
>> >
>> > You could think of /feedback as an endpoint providing access to your
>> > storage. You POST examples in and you GET examples out. One could do
>> > smart
>> > things when GETting examples:
>> > - you can get an example by URI
>> > - you can get all examples for a user
>> > - you can get all examples that "look like" your current example, where
>> > many
>> > implementations of "looks like" could be provided, including cosine
>> > similarity of TF-IDF weighted vectors.
>> >
>> > So, let's say my example now is: "Jane is a taxpayer", you could
>> > retrieve
>> > past examples that look like this one (e.g. "I am a taxpayer") and
>> > attempt
>> > not to fail again. Of course, this example is very limited, as there is
>> > very
>> > little useful information in that sentence. But if you think of a large
>> > document, and several feedback iterations, this thing starts to look
>> > very
>> > cool.
>> >
>> > In fact, your example illustrates something else that is very important.
>> > Sometimes, NONE of the alternatives is correct. So you have to
>> > standardize a
>> > way to tell the system that no annotation should be provided for that
>> > surface form (NA - Not to Annotate), or that none of the alternatives
>> > are
>> > correct (NIL - Not in KB).
>> >
>> > Now, including Stanbol wrappers, implementation of /feedback GET and
>> > POST,
>> > that is still not enough for a summer long project. You could perhaps
>> > include front-end issues: in which case you should interact with Mihály
>> > Héder as a mentor. We'd be interested in MediaWiki, Drupal, and
>> > Wordpress
>> > plugins, for example. Or if you are not a frontend guy, you can include
>> > proposals for /filter implementations. Probably the way to go with
>> > /filter
>> > is machine learning, but I'm open to hearing what else you would propose
>> > there.
>> >
>> >>
>> >>
>> >> Best regards,
>> >> Siwei Yu
>> >>
>> >>
>> >> On Fri, Mar 23, 2012 at 11:02 PM, Pablo Mendes <[email protected]>
>> >> wrote:
>> >> >
>> >> > Hi Siwei, (switching to dbp-spotlight-developers, as to avoid
>> >> > spamming
>> >> > users
>> >> > in dbp-spotlight-users)
>> >> > Please see answers below.
>> >> >
>> >> > On Fri, Mar 23, 2012 at 3:51 PM, Siwei Yu <[email protected]> wrote:
>> >> >>
>> >> >> Dear Pablo and Rupert,
>> >> >>
>> >> >> I'm sorry to post an incomplete email just now. Please ignore the
>> >> >> previous email.
>> >> >
>> >> >
>> >> > No problem. I figured it was an accidental ctrl+enter.
>> >> >
>> >> >>
>> >> >>
>> >> >> Thanks a lot for your instructions! According to your comments, let
>> >> >> me
>> >> >> summarise the current status of the service mapped to the four
>> >> >> stages:
>> >> >> (1) Spotting, (2) Candidate Selection, (3) Disambiguation, (4)
>> >> >> Filtering
>> >> >> /annotate: (1), (2), (3)first candidate, (4)
>> >> >> /candidate: (1), (2), (3)all candidate
>> >> >> /disambiguate: (3)
>> >> >> /feedback: not implemented
>> >> >> Please let me know if the previous summary is incorrect.
>> >> >
>> >> >
>> >> >
>> >> > Correct.
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> However, Apache Stanbol each Enhancement Engine in an Enhancement
>> >> >> Chain handles single task respectively (Rupert, is it true?). The
>> >> >> functions of Enhancement Engines are not supposed to overlap others.
>> >> >> We need to adjust the services of DBpedia Spotlight as follows:
>> >> >> /spot: (1), to be implemented in this project, for
>> >> >> DBpediaSpotlightSpotEngine
>> >> >
>> >> >
>> >> >
>> >> > It is likely that we will implement /spot for release v0.6, which may
>> >> > happen
>> >> > before GSoC starts.
>> >> >
>> >> >
>> >> >> /candidate: (2), to be refactored from current status, for
>> >> >> DBpediaSpotlightCandidateEngine
>> >> >> /disambiguate: (3), to be refactored from current status, for
>> >> >> DBpediaSpotlightDisambiguateEngine
>> >> >
>> >> >
>> >> >
>> >> > We would probably provide a wrapper, rather than a refactored
>> >> > version.
>> >> >
>> >> >
>> >> >>
>> >> >> /filter: (4), to be implemented in this project, for
>> >> >> DBpediaSpotlightFilterEngine
>> >> >> As to /annotate, I think it's a complicated service which is not
>> >> >> applicable for Apache Stanbol's "single task for each Enhancement
>> >> >> Engine" requirement. But we can retain it for DBpedia Spotlight for
>> >> >> other users (i.e. not for Apache Stanbol).
>> >> >
>> >> >
>> >> > Sounds like /annotate would be an enhancement chain.
>> >> >
>> >> >>
>> >> >> The /feedback API could be interesting, which I'd like to try to
>> >> >> implement. More details should be discussed beforehand. However, I'm
>> >> >> not sure there's enough time to complete it in this two-month
>> >> >> summer.
>> >> >
>> >> >
>> >> > I don't feel like wrapping DBpedia Spotlight classes is enough for a
>> >> > summer-long coding project.
>> >> > You should include the /feedback in your project to make it stronger.
>> >> > This API should take in feedback from any CMS, as Stanbol is
>> >> > CMS-agnostic.
>> >> > It should be able to store and later let engines query those, in
>> >> > order
>> >> > to
>> >> > learn from their mistakes.
>> >> > You could think, for example, about filtering implementations that
>> >> > would
>> >> > use
>> >> > feedback data to stop making the same mistakes.
>> >> > This is potentially the most interesting part for this project idea.
>> >> >
>> >> >
>> >> >>
>> >> >>
>> >> >> If the project scopes discussed above are generally OK, I'd like to
>> >> >> think about the project plan and come up with a project proposal
>> >> >> draft.
>> >> >>
>> >> >> By the way, I have two small questions for DBpedia Spotlight
>> >> >> Spotting
>> >> >> and Enhancement Chain:
>> >> >> 1. For Pablo, it's mentioned in [3] that there're three
>> >> >> implementations for Spotting: Ling Pipe Spotter, Trie Spotter, Ling
>> >> >> Pipe Chunk Spotter. How does /annotate determine which the best
>> >> >> implementation is, for a service request? Can the user choose among
>> >> >> them manually by sending different parameter(s)?
>> >> >
>> >> >
>> >> > We also have by now 4 other implementations. We have to update the
>> >> > documentation.
>> >> > Please see:
>> >> >
>> >> >
>> >> > http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/Mendes-Daiber-Rajapakse-Sasaki-Bizer-DBpediaSpotlight-LREC2012.pdf
>> >> >
>> >> >>
>> >> >> 2. For Rupert, could you please show me some examples of Enhancement
>> >> >> Chain? I've studied some Enhancement Engines here [1]. I can
>> >> >> understand how an individual Enhancement Engine works and how to
>> >> >> implement a new one. After studying [2], I find Enhancement Chain a
>> >> >> little confusing. Could you please lead me to the source code of the
>> >> >> implementation of a concrete Enhancement Chain? I want to know the
>> >> >> data I/O interface from one Enhancement Engine to another. In other
>> >> >> words, how do the output of an Enhancement Engine become the input
>> >> >> of
>> >> >> another one?
>> >> >>
>> >> >> Best regards,
>> >> >> Siwei Yu
>> >> >>
>> >> >> [1]
>> >> >>
>> >> >>
>> >> >> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/list.html
>> >> >> [2] http://incubator.apache.org/stanbol/docs/trunk/enhancer/chains/
>> >> >> [3] http://wiki.dbpedia.org/spotlight/technicaldocumentation?v=3qy
>> >> >>
>> >> >> > On Wed, Mar 21, 2012 at 4:27 PM, Rupert Westenthaler
>> >> >> > <[email protected]> wrote:
>> >> >> >>
>> >> >> >> Hi Siwei Yu, Pablo
>> >> >> >>
>> >> >> >> see my comments inline. To make it better readable I also removed
>> >> >> >> the
>> >> >> >> parts of the mail that are not relevant to my comments.
>> >> >> >>
>> >> >> >> On Wed, Mar 21, 2012 at 12:01 AM, Pablo Mendes
>> >> >> >> <[email protected]>
>> >> >> >> wrote:
>> >> >> >> > On Tue, Mar 20, 2012 at 4:24 PM, Siwei Yu <[email protected]>
>> >> >> >> > wrote:
>> >> >> >> >> 2. Should I develop one Enhancement Engine containing three
>> >> >> >> >> services,
>> >> >> >> >> or three engines (i.e. each service as an engine)? It's maybe
>> >> >> >> >> related
>> >> >> >> >> to the service function granularity. What's your opinion?
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > We could have one engine for each task separately, and an
>> >> >> >> > enhancement
>> >> >> >> > chain
>> >> >> >> > should connect them together. We should also introduce a REST
>> >> >> >> > API
>> >> >> >> > /spot for
>> >> >> >> > (1). We could perhaps make /candidates implement only (2) and
>> >> >> >> > make
>> >> >> >> > /annotate
>> >> >> >> > accept a &verbose=on to act like the current /candidates does.
>> >> >> >> >
>> >> >> >> > Besides all of this reorganization that has to happen, Rupert
>> >> >> >> > is
>> >> >> >> > the
>> >> >> >> > guy
>> >> >> >> > from Stanbol that can help you position your application in
>> >> >> >> > that
>> >> >> >> > regard.
>> >> >> >> >
>> >> >> >>
>> >> >> >> I fully agree with that.
>> >> >> >>
>> >> >> >> Having separate EnhancementEngines for spotting, candidates
>> >> >> >> selection
>> >> >> >> and disambiguation would provide a lot of additional flexibility
>> >> >> >> to
>> >> >> >> experienced Stanbol users as they could even use parts of the
>> >> >> >> DBpedia
>> >> >> >> Spotlight functionalities within their existing enhancement
>> >> >> >> engines.
>> >> >> >>
>> >> >> >> The definition of a  DBpedia Spotlight EnhancementChain ensures
>> >> >> >> that
>> >> >> >> typical users can use Spotlight without the need to know the
>> >> >> >> inner
>> >> >> >> working. Users would just need to send enhancement requests to
>> >> >> >> "http://{host}:{port}/enhancer/chin/dbpedia"; assuming that the
>> >> >> >> DBpedia
>> >> >> >> Spotlight chain is called "dbpedia". There would even be the
>> >> >> >> possibility to make the Dbpedia Spotlight EnhancementChain the
>> >> >> >> default
>> >> >> >> enhancement chain so that requests to "/enhancer" would be
>> >> >> >> processed
>> >> >> >> by it.
>> >> >> >>
>> >> >> >> >>
>> >> >> >> >> By the way, my name is Siwei Yu. I have good knowledge of
>> >> >> >> >> semantic
>> >> >> >> >> technologies, such as RDF, OWL, SPARQL. I'm also familiar with
>> >> >> >> >> the
>> >> >> >> >> mainstream Java based RDF/OWL processing tools like owlapi,
>> >> >> >> >> Jena,
>> >> >> >> >> Sesame, AllegroGraph. I have strong Java coding skills with of
>> >> >> >> >> good
>> >> >> >> >> knowledge of the software design patterns. My research
>> >> >> >> >> background
>> >> >> >> >> meets the requirements very well. I believe it'll be a
>> >> >> >> >> wonderful
>> >> >> >> >> summer working with the DBpedia Spotlight community.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > It would be good if you leveraged some of your Semantic Web
>> >> >> >> > background in
>> >> >> >> > your application. The idea of a /feedback API, which receives
>> >> >> >> > corrections
>> >> >> >> > made by the users could fit well in this regard.
>> >> >> >> >
>> >> >> >>
>> >> >> >> A feedback API is also something that would be interesting for
>> >> >> >> the
>> >> >> >> Stanbol Enhancer.
>> >> >> >>

Re: [Dbp-spotlight-users] [GSoC 2012] Project Proposal for "Integrate DBpedia Spotlight as Enhancement Engine within Apache Stanbol" (Siwei Yu)

Reply via email to