On 8/18/11 1:17 AM, Boris Galitsky wrote:
Hello
attached are three packages which is our current version of our
proposed contribution of syntactic match / text relevance component
for openNLP.
Did everyone get the attachments? Usually we use jira for this, because
mail attachments used to be removed
when posted here. Not sure why I got it anyway.
I suggest that you additionally open a jira issue for this contribution,
and then attach the zip files to it.
Here is the link to it:
https://issues.apache.org/jira/browse/OPENNLP
To start looking at it, please go to SyntMatcherTest.java and see the
results how commonality between sentences are computed.
Then you can go to ParseTreeChunkTest.java and see how the operation
of syntactic generalization is applied to particular chunks.
As an application, we selected the problem of content generation when
relevance is critical.
Please go to "RelatedSentenceFinder" and see which sentences might
serve as seeds for content generation.
The system goes on the web and finds somewhat relevant sentences to
the seed ones and tries to "write an article".
As examples of auto-generated articles using this technology please see
http://www.allvoices.com/contributed-news/9423860-best-things-to-do-in-san-francisco-jazz-and-blues-festival
http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area
http://www.allvoices.com/contributed-news/9381803-cirque-du-soleil-quidam
This articles were generated using this class
RelatedSentenceFinder.java
Hence the proposed structure of our contribution:
package opennlp.tools.similarity, main and test: implementation of
syntactic match
package opennlp.tools.similarity.apps: the content generation app
leveraging syntactic match for sentence-level similarity
package opennlp.tools.similarity.apps.utils: utils for the above.
What we needs to be done before full consideration for contribution
can be done:
1) make it use latest openNLP (now it is using a modified version of
2008's openNLP, although pretty stable, working for 2 years in
industrial settings)
2) fix all tests, add more tests
3) clean the implementation and application code
4) add more applications to show more working scenarios of syntactic match
5) in addition to academic papers, have better docs for developers
We have a sandbox where it could live for a while until it is ready to
be released together
with the current head code. I would suggest to move it there, and then
maybe we have
a good chance to release it with one of the coming 1.5 series releases
or 1.6.
Would that work for you?
I will have a look at the code tomorrow.
Jörn