> Dawid Weiss wrote: > > You could also try splitting the document into paragraphs and use Carrot2's > > Lingo algorithm (www.carrot2.org) on a paragraph-level to extract clusters. > > Labelling routine in Lingo should extract 'key' phrases; this analysis is > > heavily frequency-based, but... you know, you may want to try it. > > Just to make sure I'm following... > > So you're suggesting splitting the document into paragraphs, then > treating each paragraph as if it were a Carrot2 search result, > performing the clustering, then looking at the label Lingo chooses for > each cluster, and treating that label as the "key phrase"?
I tried it. Not so great results, but perhaps I'm doing it wrong. Here's my code. The input file is a text file with an ID number and one paragraph per (long) line -- standard textual paragraphs. I'm running on a corpus of technical papers. Bill ----------------------------------------------------------------- import org.carrot2.filter.lingo.common.*; import org.carrot2.filter.lingo.lsicluster.*; import java.io.*; public class test { public static void main (String[] argv) { DefaultClusteringContext context = new DefaultClusteringContext(); try { BufferedReader r = new BufferedReader(new FileReader(argv[0])); String line; while ((line = r.readLine()) != null) { String[] parts = line.trim().split("\\s"); // there must be an easier way to split off the first token // of a line... if (parts.length > 1) { String id = parts[0]; // and to glue the other parts together again... String body = parts[1]; for (int i = 2; i < parts.length; i++) { body = body + " " + parts[i]; } context.addSnippet(new Snippet(id, "", body)); } } } catch (Exception x) { x.printStackTrace(System.err); } context.setQuery(""); Cluster[] clusters = context.cluster(); for (int i = 0; i < clusters.length; i++) { System.out.println("Cluster --"); String[] labels = clusters[i].getLabels(); for (int j = 0; j < labels.length; j++) { System.out.println(" Label: " + labels[j]); } Snippet[] snippets = clusters[i].getSnippets(); System.out.println(" " + snippets.length + " snippets:"); for (int j = 0; j < snippets.length; j++) { System.out.println(" " + snippets[j].getSnippetId() + " -- " + snippets[j].getText()); } } } } ---------------------------------------------------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]