Re: [ngram] Re: Another "Can you do this with NSP/Ngram"

ted pedersen Mon, 01 May 2006 15:06:27 -0700

Greetings Leonardo,

A few misc comments...

> There is this text file I created from the Gnome User Guide; the most
> frequent bigram is:
> this<>option<>134 283 187
>
> AFAIK, it means the bigram occurs 134 times in the file.

Correct! It also means that 'this' occurs 283 times as the first word in
any bigram, and 'option' occurs 187 times as the second word in any
bigram.

> But it doesn't represent the occurrences of "this option", because:
>
> $ grep "this option" file | wc --lines
> 127

I think your grep command and NSP are doing two different things.
Unless you have used the --newLine option with count.pl, NSP will
count across line boundaries. In other words, it would find

I will get this
option installed
today

as one occurrence of 'this option', while the grep command does not.

> $ grep this file | grep option | wc --lines
> 134
>
> I need a list of /some^\wstuff/, not /some.*stuff/. (did I write it right?)

So you want only contiguous occurrences of 'this option', that is
without any intermediate or intervening words? That is what you should
be getting. The default window size is 2, and that means that all
words that are being counted in a bigram must occur consecutively.

> BTW, is there any way to make nsp case-insensitive?

No, NSP pretty much takes the text 'as is' with respect to case. So
you would just need to modify the text to be all lower or upper
case. Same thing if you were interested in stemming, that is something
you would need to do prior to count.pl taking over. NSP does support
the removal of stop words though...

>
> These are the other 7 lines including both words:

This is interesting, I really think this might be a coincidence. I
think if you run count.pl with --newLine you will find that the
count agrees with grep above.

I hope this helps a bit!

Cordially,
Ted

> This option is only available for DocBook documentation. Legal notices
> and documentation contributors are usually listed in this section.
>
> The options that are available in this dialog have the following functions:
>
> To execute actions other than the default action for a file, select
> the file that you want to perform an action on. In the File menu you
> will either have "Open with" choices, or an Open With submenu. Select
> the desired option from this list.
>
> The options shown in this tabbed section depend on the X windowing
> system you are using. Not all the following options might be listed on
> your system, and not all the options shown might work on your system.
>
> Use these options to add the Euro currency symbol to a key as a
> third-level character. To access this symbol, you must assign a third
> level chooser.
>
> Use this group of options to set the location of the Ctrl key to match
> the layout on older keyboards.
>
> To reverse the handedness of your mouse device, start the Mouse
> Preferences, then select the options that you require. If you do
> reverse the handedness of your mouse device, then you must reverse the
> mouse button conventions used in this manual. See for more
> information about setting your mouse preferences.
>
>
>
> Thanks!
>
> Leonardo
>
> 2006/5/1, Saiyam Kohli <[EMAIL PROTECTED]>:
> > Hi,
> >
> > You can use count.pl to get a list of most frequent
> > expressions(bigrams) in a text sorted in the descending order. To use
> > count.pl you need to give the following command:
> >
> > count.pl [OPTIONS] DESTINATION SOURCE
> >
> > Where <SOURCE> is the text file from which you want to obtain the list
> > of frequent expressions. The <DESTINATION> file will contain the
> > output generated by count.pl. Here is what the output of count.pl
> > generally looks like:
> >
> > 1922
> > and<>conveniences<>9 63 11
> > luxuries<>and<>9 12 63
> > to<>be<>8 63 16
> > conveniences<>of<>7 11 44
> > ,<>the<>7 72 48
> > the<>luxuries<>6 48 12
> > of<>the<>6 44 48
> > strong<>and<>6 8 63
> > and<>independent<>6 63 8
> > contemporary<>life<>5 6 25
> > independent<>individuals<>5 8 6
> > from<>developing<>5 6 5
> > .<>In<>5 79 5
> > people<>from<>5 23 6
> > into<>truly<>5 6 6
> > prevent<>people<>5 6 23
> > ....
> > ...
> > ...
> >
> > As you can see the output contains the most frequent bigrams and also
> > the number of times each occur(the first number after each bigram),
> > the other numbers are for statistical purposes. The first line of this
> > file gives the total number of bigrams in the file.
> >
> > You can also use one of the statistical measures in statistic.pl to
> > get a list of bigrams sorted in the order of their statistical
> > importance. I believe for your purpose the output from statistic.pl
> > would be more relevant. To use statistic.pl you need to give the
> > following command:
> >
> > statistic.pl STATISTIC_LIBRARY DESTINATION SOURCE
> >
> > Here the <STATISTICAL_LIBRARY> is the statistical measure you want to
> > use to compute the relevance of a bigram, and <SOURCE> is the output
> > file generated by count.pl ( so you need the output of count.pl to run
> > statistic.pl). The output of statistic.pl looks like as follows:
> >
> > 1922
> > from<>developing<>1 64.0971 5 6 5
> > developing<>into<>1 64.0971 5 5 6
> > into<>truly<>2 58.6914 5 6 6
> > independent<>individuals<>3 53.5152 5 8 6
> > truly<>strong<>3 53.5152 5 6 8
> > and<>conveniences<>4 52.5168 9 63 11
> > luxuries<>and<>5 49.5092 9 12 63
> > entirely<>harmless<>6 44.7704 3 3 3
> > as<>well<>7 41.3420 4 13 4
> > people<>from<>8 40.0310 5 23 6
> > prevent<>people<>8 40.0310 5 6 23
> > conveniences<>of<>9 39.7651 7 11 44
> > contemporary<>life<>10 39.0979 5 6 25
> > human<>beings<>11 38.0403 3 5 3
> > ...
> > ...
> >
> > This file also contains the expressions but they are sorted in the
> > order of their significance and the first number after each bigram is
> > the score given by the statistical measure used, which in this case is
> > the Log Likelihood(ll) measure.
> >
> >
> > You can further improve the results generated by removing stop words
> > like 'is', 'the', 'for', 'of', 'a',.... etc by using a stoplist with
> > count.pl. If you need to use this option please feel free to contact
> > me and I will mail you a list of stop words.
> >
> > If you think that this is what you need and plan to use NSP, I
> > strongly suggest that you read the README and other documentation for
> > count.pl and statistic.pl to get a better understanding of hoe NSP
> > works.
> >
> > If you have any other questions/comments please feel free to contact me directly
> >
> >
> > Saiyam Kohli,
> > Graduate Student,
> > University of Minnesota Duluth.
> >
> >
> >
> > --
> > -----------------------------------------------------------------------------------------------------------------------
> > Yesterday is gone. Tomorrow has not yet come. We have only today. Let
> > us begin. -- Mother Teresa
> >
>
>
>
> Yahoo! Groups Links
>
>
>
>
>
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

SPONSORED LINKS

`Computer internet security`	`Package design`	`Ski packages`
`Vacation packages`	`Snowboard packages`	`Package integrity testing`

YAHOO! GROUPS LINKS

Visit your group "ngram" on the web.
To unsubscribe from this group, send an email to: [EMAIL PROTECTED]
Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

Re: [ngram] Re: Another "Can you do this with NSP/Ngram"

Reply via email to