OK, now I know count.pl is concerned about the order of the tokens in
the n-gram. That helps a lot, but I'm still getting a discrepancy
which probably reflects that (1) I can't get n-gram to do what I want
yet; or (2) n-gram isn't the right tool for the job.

There is this text file I created from the Gnome User Guide; the most
frequent bigram is:
this<>option<>134 283 187

AFAIK, it means the bigram occurs 134 times in the file. But it
doesn't represent the occurrences of "this option", because:

$ grep "this option" file | wc --lines
127
$ grep this file | grep option | wc --lines
134

I need a list of /some^\wstuff/, not /some.*stuff/. (did I write it right?)

BTW, is there any way to make nsp case-insensitive?

These are the other 7 lines including both words:



This option is only available for DocBook documentation. Legal notices
and documentation contributors are usually listed in this section.

The options that are available in this dialog have the following functions:

To execute actions other than the default action for a file, select
the file that you want to perform an action on. In the File menu you
will either have "Open with" choices, or an Open With submenu. Select
the desired option from this list.

The options shown in this tabbed section depend on the X windowing
system you are using. Not all the following options might be listed on
your system, and not all the options shown might work on your system.

Use these options to add the Euro currency symbol to a key as a
third-level character. To access this symbol, you must assign a third
level chooser.

Use this group of options to set the location of the Ctrl key to match
the layout on older keyboards.

To reverse the handedness of your mouse device, start the Mouse
Preferences, then select the options that you require. If you do
reverse the handedness of your mouse device, then you must reverse the
mouse button conventions used in this manual. See  for more
information about setting your mouse preferences.



Thanks!

Leonardo

2006/5/1, Saiyam Kohli <[EMAIL PROTECTED]>:
> Hi,
>
> You can use count.pl to get a list of most frequent
> expressions(bigrams) in a text sorted in the descending order. To use
> count.pl you need to give the following command:
>
> count.pl [OPTIONS] DESTINATION SOURCE
>
> Where <SOURCE> is the text file from which you want to obtain the list
> of frequent expressions. The <DESTINATION> file will contain the
> output generated by count.pl. Here is what the output of count.pl
> generally looks like:
>
> 1922
> and<>conveniences<>9 63 11
> luxuries<>and<>9 12 63
> to<>be<>8 63 16
> conveniences<>of<>7 11 44
> ,<>the<>7 72 48
> the<>luxuries<>6 48 12
> of<>the<>6 44 48
> strong<>and<>6 8 63
> and<>independent<>6 63 8
> contemporary<>life<>5 6 25
> independent<>individuals<>5 8 6
> from<>developing<>5 6 5
> .<>In<>5 79 5
> people<>from<>5 23 6
> into<>truly<>5 6 6
> prevent<>people<>5 6 23
> ....
> ...
> ...
>
> As you can see the output contains the most frequent bigrams and also
> the number of times each occur(the first number after each bigram),
> the other numbers are for statistical purposes. The first line of this
> file gives the total number of bigrams in the file.
>
> You can also use one of the statistical measures in statistic.pl to
> get a list of bigrams sorted in the order of their statistical
> importance. I believe for your purpose the output from statistic.pl
> would be more relevant. To use statistic.pl you need to give the
> following command:
>
> statistic.pl STATISTIC_LIBRARY DESTINATION SOURCE
>
> Here the <STATISTICAL_LIBRARY> is the statistical measure you want to
> use to compute the relevance of a bigram, and <SOURCE> is the output
> file generated by count.pl ( so you need the output of count.pl to run
> statistic.pl). The output of statistic.pl looks like as follows:
>
> 1922
> from<>developing<>1 64.0971 5 6 5
> developing<>into<>1 64.0971 5 5 6
> into<>truly<>2 58.6914 5 6 6
> independent<>individuals<>3 53.5152 5 8 6
> truly<>strong<>3 53.5152 5 6 8
> and<>conveniences<>4 52.5168 9 63 11
> luxuries<>and<>5 49.5092 9 12 63
> entirely<>harmless<>6 44.7704 3 3 3
> as<>well<>7 41.3420 4 13 4
> people<>from<>8 40.0310 5 23 6
> prevent<>people<>8 40.0310 5 6 23
> conveniences<>of<>9 39.7651 7 11 44
> contemporary<>life<>10 39.0979 5 6 25
> human<>beings<>11 38.0403 3 5 3
> ...
> ...
>
> This file also contains the expressions but they are sorted in the
> order of their significance and the first number after each bigram is
> the score given by the statistical measure used, which in this case is
> the Log Likelihood(ll) measure.
>
>
> You can further improve the results generated by removing stop words
> like 'is', 'the', 'for', 'of', 'a',.... etc by using a stoplist with
> count.pl. If you need to use this option please feel free to contact
> me and I will mail you a list of stop words.
>
> If you think that this is what you need and plan to use NSP, I
> strongly suggest that you read the README and other documentation for
> count.pl and statistic.pl to get a better understanding of hoe NSP
> works.
>
> If you have any other questions/comments please feel free to contact me directly
>
>
> Saiyam Kohli,
> Graduate Student,
> University of Minnesota Duluth.
>
>
>
> --
> -----------------------------------------------------------------------------------------------------------------------
> Yesterday is gone. Tomorrow has not yet come. We have only today. Let
> us begin. -- Mother Teresa
>


SPONSORED LINKS
Computer internet security Package design Ski packages
Vacation packages Snowboard packages Package integrity testing


YAHOO! GROUPS LINKS




Reply via email to