[ https://issues.apache.org/jira/browse/SOLR-3975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13482972#comment-13482972 ]
Otis Gospodnetic commented on SOLR-3975: ---------------------------------------- Nice, 170KB patch there Lance! :) I see lots of classes don't have ASL btw. > Document Summarization toolkit, using LSA techniques > ---------------------------------------------------- > > Key: SOLR-3975 > URL: https://issues.apache.org/jira/browse/SOLR-3975 > Project: Solr > Issue Type: New Feature > Reporter: Lance Norskog > Priority: Minor > Attachments: 4.1.summary.patch, reuters.sh > > > This package analyzes sentences and words as used across sentences to rank > the most important sentences and words. The general topic is called "document > summarization" and is a popular research topic in textual analysis. > How to use: > 1) Check out the 4.x branch, apply the patch, build, and run the solr/example > instance. > 2) Download the first Reuters article corpus from: > http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz > 3) Unpack this into a directory. > 4) Run the attached 'reuters.sh' script: > sh reuters.sh directory http://localhost:8983/solr/collection1 > 5) Wait several minutes. > Now go to http://localhost:8983/solr/collection1/browse?summary=true and look > at the large gray box marked 'Document Summary'. This has a table of > statistics about the analysis, the three most important sentences, and > several of the most important words in the documents. The sentences have the > important words in italics. > The code is packaged as a search component and as an analysis handler. The > /browse demo uses the search component, and you can also post raw text to > http://localhost:8983/solr/collection1/analysis/summary. Here is a sample > command: > {code} > curl -s > "http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml" > --data-binary @$FILE -H 'Content-type:application/xml' > {code} > This is an implementation of LSA-based document summarization. A short > explanation and a long evaluation are described in my blog, [Uncle Lance's > Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: > [http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org