Lance Norskog created SOLR-3975:
-----------------------------------

             Summary: Document Summarization toolkit, using LSA techniques
                 Key: SOLR-3975
                 URL: https://issues.apache.org/jira/browse/SOLR-3975
             Project: Solr
          Issue Type: New Feature
            Reporter: Lance Norskog
            Priority: Minor
         Attachments: 4.1.summary.patch, reuters.sh

This package analyzes sentences and words as used across sentences to rank the 
most important sentences and words. The general topic is called "document 
summarization" and is a popular research topic in textual analysis. 

How to use:
1) Check out the 4.x branch, apply the patch, build, and run the solr/example 
instance.
2) Download the first Reuters article corpus from:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
3) Unpack this into a directory.
4) Run the attached 'reuters.sh' script:
sh reuters.sh directory http://localhost:8983/solr/collection1
5) Wait several minutes.

Now go to http://localhost:8983/solr/collection1/browse?summary=true and look 
at the large gray box marked 'Document Summary'. This has a table of statistics 
about the analysis, the three most important sentences, and several of the most 
important words in the documents. The sentences have the important tags in 
italics.

The code is packaged as a search component and as an analysis handler. The 
/browse demo uses the search component, and you can also post raw text to  
http://localhost:8983/solr/collection1/analysis/summary. Here is a sample 
command:
curl -s 
"http://localhost:8983/solr/analysis/summary?indent=true&echoParams=all&file=$FILE&wt=xml";
 --data-binary @$FILE -H 'Content-type:application/xml'

This is an implementation of LSA-based document summarization. A short 
explanation and a long evaluation are described in my blog, [Uncle Lance's 
Ultra Whiz Bang|http://ultrawhizbang.blogspot.com], starting here: 
[http://ultrawhizbang.blogspot.com/2012/09/document-summarization-with-lsa-1.html]



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to