On 8/18/11 1:17 AM, Boris Galitsky wrote:

Hello


attached are three packages which is our current version of our proposed contribution of syntactic match / text relevance component for openNLP.

Did everyone get the attachments? Usually we use jira for this, because mail attachments used to be removed
when posted here. Not sure why I got it anyway.

I suggest that you additionally open a jira issue for this contribution, and then attach the zip files to it.

Here is the link to it:
https://issues.apache.org/jira/browse/OPENNLP

To start looking at it, please go to SyntMatcherTest.java and see the results how commonality between sentences are computed. Then you can go to ParseTreeChunkTest.java and see how the operation of syntactic generalization is applied to particular chunks.

As an application, we selected the problem of content generation when relevance is critical. Please go to "RelatedSentenceFinder" and see which sentences might serve as seeds for content generation. The system goes on the web and finds somewhat relevant sentences to the seed ones and tries to "write an article".

As examples of auto-generated articles using this technology please see
http://www.allvoices.com/contributed-news/9423860-best-things-to-do-in-san-francisco-jazz-and-blues-festival

http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area

http://www.allvoices.com/contributed-news/9381803-cirque-du-soleil-quidam
This articles were generated using this class
RelatedSentenceFinder.java

Hence the proposed structure of our contribution:

package opennlp.tools.similarity, main and test: implementation of syntactic match package opennlp.tools.similarity.apps: the content generation app leveraging syntactic match for sentence-level similarity
package opennlp.tools.similarity.apps.utils: utils for the above.

What we needs to be done before full consideration for contribution can be done: 1) make it use latest openNLP (now it is using a modified version of 2008's openNLP, although pretty stable, working for 2 years in industrial settings)
2) fix all tests, add more tests
3) clean the implementation and application code
4) add more applications to show more working scenarios of syntactic match
5) in addition to academic papers, have better docs for developers


We have a sandbox where it could live for a while until it is ready to be released together with the current head code. I would suggest to move it there, and then maybe we have a good chance to release it with one of the coming 1.5 series releases or 1.6.

Would that work for you?

I will have a look at the code tomorrow.

Jörn

Reply via email to