h t skrev:
Where is the introduction of below algorithm? Thanks.

I can't recall where I picked it up, but something like this:

Score terms by count and distribution. A term occuring 20 times in the same paragraph is not as important as a term occuring 20 times over 10 paragraphs. Similar terms affect each others score (I used ngrams and I also detected abbreviations). Be sure to remove stop words using some term reduction algorithm. Language detection makes sense. Rank sentances by length normalized term score. The top n sentances is your summary. If enough of these sentances are parts of the same paragraph and the paragraph is small enough, that is your summary instead.

I hope this helps.


    karl


"Very simple algorithmic solutions usually involve ranking top senstances
by looking at distribution of terms in sentances, paragraphs and the
whole document. I implemented something like this a couple of years back
that worked fairly well."



2008/2/29, Karl Wettin <[EMAIL PROTECTED]>:
[EMAIL PROTECTED] skrev:

If you want something from an index it has to be IN the
index. So, store a
summary field in each document and make sure that field is part of the
query.
And how could one create automatically such a summary?
Taking the first 2 lines of a document makes not always much sense.
How does google this?

Google don't summarize, they highlight parts that match the query. See
previous reponses.

If you really want to summarize there are a number of more and less
scientific ways to figure out what's important and what's not.

Very simple algorithmic solutions usually involve ranking top senstances
by looking at distribution of terms in sentances, paragraphs and the
whole document. I implemented something like this a couple of years back
that worked fairly well.

Citeseer is a great source for papers on pretty much any IR related
subject: <http://citeseer.ist.psu.edu/cs?cs=1&q=text+summarization>



    karl


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to